Posters - Schedules

Posters Home

View Posters By Category

Monday, July 11 and Tuesday, July 12 between 12:30 PM CDT and 2:30 PM CDT
Wednesday July 13 between 12:30 PM CDT and 2:30 PM CDT
Session A Poster Set-up and Dismantle Session A Posters set up:
Monday, July 11 between 7:30 AM CDT - 10:00 AM CDT
Session A Posters dismantle:
Tuesday, July 12 at 6:00 PM CDT
Session B Poster Set-up and Dismantle Session B Posters set up:
Wednesday, July 13 between 7:30 AM - 10:00 AM CDT
Session B Posters dismantle:
Thursday. July 14 at 2:00 PM CDT
Virtual: Analysis of EccD3, ESX3 secretion system component as Mycobacterium tuberculosis drug target
COSI: GenCompBio
  • Ana Laura Granados-Tristán, Universidad Autónoma de Nuevo León, Mexico
  • Mauricio Carrillo-Tripp, Centro de investigación y de Estudios Avanzados del Instituto Politécnico Nacional Unidad Monterrey, Mexico
  • Carlos Eduardo Hernández-Luna, Universidad Autónoma de Nuevo León, Mexico
  • Aldo Herrera-Rodulfo, Centro de investigación y de Estudios Avanzados del Instituto Politécnico Nacional Unidad Monterrey, Mexico
  • Laura Adiene Gonzalez-Escalante, Centro de Investigación Biomédica del Noreste del IMSS, Mexico
  • Beatriz Silva-Ramirez, Centro de Investigación Biomédica del Noreste del IMSS, Mexico
  • Brenda Leticia Escobedo-Guajardo, Centro de Investigación Biomédica del Noreste del IMSS, Mexico
  • Mario Bermúdez de León, Centro de Investigación Biomédica del Noreste del IMSS, Mexico
  • Katia Peñuelas-Urquides, Centro de Investigación Biomédica del Noreste del IMSS, Mexico


Presentation Overview: Show

Drug-resistant tuberculosis (DR-TB) is a global health problem that needs the development of new drugs and the identification of novel therapeutic targets. The ESX3 secretion system is essential to Mycobacterium tuberculosis survival and virulence. This system is comprised of EccB3, EccC3, EccD3, and EccE3 proteins. The aim of this study was to evaluate EccD3 protein as drug target. The 3D structure of M. tuberculosis EccD3 protein was predicted by homology modeling using Mycolicibacterium smegmatis structures as templates in SwissModel. We selected 34 antituberculosis drugs and obtained their 3D structures from PubChem. Protein-drug interactions were evaluated by molecular docking using AutoDock Vina. Biological activities, ADME properties and toxicity were obtained by online webservers. EccD3 structure of M. tuberculosis was obtained with high quality. Two potential sites predicted to destabilize EccD3 structure present in the interfaces were identified and selected as drug targets. The best bindings with the EccD3 dimer interfaces were against moxidectin and selamectin with -8.4 and -7.4 kcal/mol of binding energy free, respectively. We found interactions of moxidectin and selamectin with EccD3 interfaces in silico and these interactions may alter the dimer structure of EccD3. EccD3 protein have a potential as moxidectin and selamectin target.

Virtual: Applying metagenomics with long read sequencing data as diagnostic tool for infectious disease
COSI: GenCompBio
  • Antonio Mauro Rezende, Institute of Tropical Medicine, Belgium
  • Tessa de Block, Institute of Tropical Medicine, Belgium
  • Marjan Van Esbroeck, Institute of Tropical Medicine, Belgium
  • Kevin Arien, Institute of Tropical Medicine, Belgium
  • Philippe Selhorst, Institute of Tropical Medicine, Belgium
  • Koen Vercauteren, Institute of Tropical Medicine, Belgium


Presentation Overview: Show

Laboratory tests assist clinicians to identify the etiology of an infection. State-of-the-art tests (e.g. antigen tests, PCR, serology) are designed to detect specific pathogens or pathogen exposure, hence require prior pathogen suspicion based on symptoms and geography that are often overlapping between (tropical) infectious diseases. However, pathogens not tested for and new emerging pathogens will be missed. Metagenomic sequencing provides an attractive alternative to traditional pathogen-specific testing as it allows for the sequencing of all genomic material present in the patient sample. Here, we develop a wet lab protocol for pathogen non-specific enrichment from serum samples coupled to Nanopore sequencing and a completely automatic computational pipeline as a diagnostic approach for viral infectious disease. The pipeline has been developed using NextFlow environment along with several containers (Docker or Singularity) to allow easy distribution. The procedure has been tested with serum samples spiked with different culture-derived viruses (CHIKV, ZIKV, DENV, SARS-Cov-2) as well as a clinical sample from a CHIKV infected patient. Different concentrations virus have been assessed reflecting Ct value distributions found in routine clinical practice (25-35). In conclusion, we were able to recover the complete genome of the pathogen even with the highest Ct tested.

Virtual: Characterization of circRNAs in neuroblastoma
COSI: GenCompBio
  • Md. Tofazzal Hossain, Shenzhen Institutes of Advanced Technology, CAS, China
  • Jingjing Zhang, Shenzhen Institutes of Advanced Technology, CAS, China
  • Yanjie Wei, Shenzhen Institutes of Advanced Technology, CAS, China


Presentation Overview: Show

Circular RNA (circRNA) is an RNA molecule different from linear RNA, it has a closed loop structure and not easily degraded by RNA exonuclease. In addition, circRNA has more stable expression and strong tissue spatiotemporal expression specificity. A series of studies have confirmed that circRNAs play key roles in the pathogenesis of a series of common childhood malignancies including NB. Currently, we do not yet know which circRNAs are more critical for NB diagnosis and treatment, and which circRNAs affect NB-related metabolic pathways. Here, we analyzed NB-related circRNAs to explorer effective targets for diagnosis and treatment of NB. A total of 76127 circRNAs were identified (by CIRCexplorer) in all samples and 26330 of them were co-expressed in cancer and normal samples. We found a significant amount of DE circRNAs and the circRNAs were interacted with the neuroblastoma related miRNAs. The host genes of the DE circRNAs were also enriched in several significant biological processes, molecular functions, cellular components and important signaling pathways. Our analysis revealed that circRNAs might play important role in diagnosis and treatment of neuroblastoma. Further studies are needed to explore the specific role of circRNAs in neuroblastoma.

Virtual: CView: A network based tool for enhanced alignment visualization
COSI: GenCompBio
  • Raquel Linheiro, CIBIO/InBio – Research Centre in Biodiversity and Genetic Resources, Portugal
  • Diana Lobo, CIBIO/InBio – Research Centre in Biodiversity and Genetic Resources, Portugal
  • Stephen Sabatino, CIBIO/InBio – Research Centre in Biodiversity and Genetic Resources, Portugal
  • John Archer, CIBIO/InBio – Research Centre in Biodiversity and Genetic Resources, Portugal


Presentation Overview: Show

To date the visualization of alignments have focused on displaying per-site columns of residues along with associated summarizations. The persistence of this tendency to tools designed for viewing mapped reads indicates that such a perspective not only provides a reliable visualization of per-site alterations, but also offers reassurance to end-users in relation to data accessibility. However, the insight gained is limited, something that is especially true when viewing alignments consisting of many sequences representing differing factors, such as location, date and subtype. An alignment viewer can have potential to increase insight through visual enhancement, whilst not delving into the realms of complex sequence analysis. We present CView, a visualizer that expands the per-site representation of residues through the incorporation of a network that is based on the summarization of diversity present across different regions of the alignment. If a node is selected, then the relationship that all sequences passing through that node have to other regions of diversity within the alignment can be observed through paths. CView provides many export features including variant summarization as well as per-site residue and kmer frequency matrixes, each of which applicable to varying research areas. It is open source, user-friendly and available at: https://sourceforge.net/projects/cview/.

Virtual: DeepFlu: Forecasting symptomatic influenza A infection based on pre-exposure gene expression
COSI: GenCompBio
  • Anna Zan, National Taiwan Ocean University, Taiwan
  • Zhong-Ru Xie, University of Georgia, United States
  • Yi-Chen Hsu, National Taiwan Ocean University, Taiwan
  • Yu-Hao Chen, National Taiwan Ocean University, Taiwan
  • Tsung-Hsien Lin, National Taiwan Ocean University, Taiwan
  • Yong-Shan Chang, National Taiwan Ocean University, Taiwan
  • Kuan Y. Chang, National Taiwan Ocean University, Taiwan


Presentation Overview: Show

Motivation
Not everyone gets sick after an exposure to influenza A viruses (IAV). Although KLRD1 is a potential biomarker for influenza susceptibility, it remains unclear whether forecasting symptomatic flu infection based on pre-exposure host gene expression might be possible. To examine this hypothesis, we developed DeepFlu using the state-of-the-art deep learning approach on the human subjects infected with IAV subtype H1N1 or H3N2 viruses to forecast who would catch the flu prior to an exposure to IAV.

Results
Such forecast is possible. In the leave-one-person-out cross-validation, DeepFlu based on deep neural network outperformed the models using convolutional neural network, random forest, or support vector machine, achieving 70.0% accuracy, 0.787 AUROC, and 0.758 AUPR for H1N1 and 73.8% accuracy, 0.847 AUROC, and 0.901 AUPR for H3N2. In the external validation, DeepFlu also reached 71.4% accuracy, 0.700 AUROC, and 0.723 AUPR for H1N1 and 73.5% accuracy, 0.732 AUROC, and 0.749 AUPR for H3N2, surpassing the KLRD1 biomarker. Besides, DeepFlu trained only by pre-exposure data works the best and mixed training data of H1N1 and H3N2 did not necessarily enhance prediction. DeepFlu is a prognostic tool that can moderately recognize individuals susceptible to the flu and may help prevent the spread of IAV.

Virtual: Elucidating the extra-cellular regulators of cell fate trajectories with Entrain
COSI: GenCompBio
  • Wunna Kyaw, Garvan Institute of Medical Research, Australia
  • Tri Phan, Garvan Institute of Medical Research, Australia
  • John Murray, UNSW, Australia


Presentation Overview: Show

Cell fate is commonly studied by profiling the gene expression of single cells to infer developmental trajectories based on expression similarity, RNA velocity, or statistical mechanical approaches. However, current approaches do not recover external signals from the microenvironmental niche that drive a differentiation trajectory. Here, we address this issue by presenting a computational method (Entrain) that unites traditional trajectory inference and RNA velocity with ligand regulatory databases. By fitting trajectories to regulatory databases using a random forest model, Entrain predicts driver ligands responsible for differentiation and decomposes trajectories into environmentally governed components and cell-intrinsic components. Further, Entrain quantifies the degree to which the niche is responsible for the observed trajectory dynamics, improving on existing methods for cell-cell communication inference that rely solely on differential expression of ligand-receptor genes in pre-defined clusters. Finally, we apply Entrain to an orthogonal modality, RNA velocity, to elucidate key environmental signals that are responsible for observed velocities.
We validate our approach on a single-cell bone marrow microenvironmental dataset to recapitulate known environmental drivers of cell fate commitment in haematopoietic and mesenchymal stromal cell lineages. We anticipate this method will help elucidate the driving interactions between developing cells and the governing niches which shape cell fate.

Virtual: Exploring Differences in Tumor Mutation Burden in Acute Lymphoid Leukemia
COSI: GenCompBio
  • Sanjana Sundara Raj Sreenath, Texas Tech University Health Sciences Center - El Paso, United States
  • Johnathon Mohl, The University of Texas at El Paso, United States


Presentation Overview: Show

Acute lymphoid leukemia (ALL) is a malignancy caused by uncontrolled proliferation of immature B or T lymphocyte precursors. Epidemiological studies have found increased ALL mortality in Hispanic Americans. The purpose of this study is to explore differences in tumor mutation burden (TMB) in Acute Lymphoblastic Leukemia among Hispanic and non-Hispanic patient subsets. Variant call formatted (VCF) files containing mutational data were downloaded from the cancer genome atlas database (TCGA) and fed into the UTEP’s OncoMiner pipeline. For each patient, a list of genes along with their individual tumor mutation burden was obtained. Differences in TMB subsetted by clinical and demographic data were analyzed. Data processing and statistical analysis was done using an in-house Python script incorporating Pandas, Numpy and Scipy. 534 patients (20.2% Hispanic, 79.6% non-Hispanic) were included in the study. Mean TMB between Hispanic and non-Hispanic groups were significantly different with significant differences noted between male Hispanic vs male non-Hispanic TMB. Further, Hispanic vs non-Hispanic B cell specific TMB differences were significant. No significant differences between female TMB values or general B cell TMB values were observed. An in-depth understanding of these group differences could significantly impact ALL prognosis and treatment options.

Virtual: FusionPDB: A Knowledgebase for Human Fusion Proteins
COSI: GenCompBio
  • Himansu Kumar, University of Texas Health Science and Center Houston, Texas, USA, United States
  • Lin-Ya Tang, University of Texas Health Science and Center Houston, Texas, USA, United States
  • Pora Kim, University of Texas Health Science and Center Houston, Texas, USA, United States


Presentation Overview: Show

Aberrant regulation of pathogenic functions due to the formation of the fusion proteins was targeted for cancer therapeutics (i.e., kinase inhibitors). However, there are still many fusion proteins awaiting being targeted for therapeutics. To fill this gap, we developed a new computational pipeline and a resource, named FusionPDB. Previously, we reported ~ 43K of human fusion protein sequences, translated from 126k fusion transcripts, transcribed from 16K in-frame fusion genes in FusionGDB 2.0. In this study, we predicted the fusion amino acid sequence of 1,006 curated human fusion genes and the 3D structure of these fusion proteins. Then, we investigated the active sites of individual fusion proteins and performed virtual screening between the active sites of 1,267 fusion proteins and the 1,615 FDA-approved drugs. We demonstrated the feasibility of our approaches to several well-known fusion proteins and their drugs through molecular dynamics simulation. We also reported the toxicity and pharmacokinetics of ADME properties of the identified small molecules. FusionPDB is the only resource that provides comprehensive knowledge on human fusion proteins and it will be regularly updated to cover ~ 43K human fusion proteins in the future. It will be routinely used by diverse cancer and drug research communities.

Virtual: GaWRDenMap: A spatially-informed framework to quantify heterogeneity in inter-cellular interactions
COSI: GenCompBio
  • Santhoshi Krishnan, Rice University, United States
  • Shariq Mohammed, Boston University, United States
  • Timothy Frankel, University of Michigan, United States
  • Arvind Rao, University of Michigan, United States


Presentation Overview: Show

In recent years, spatially informed disease modelling is being used to capture cellular heterogeneity in the tumor microenvironment. Tumor heterogeneity quantification is important for pancreatic malignancies, where the occurrence of pancreatic ductal adenocarcinoma (PDAC) with pre-cancerous lesions simultaneously can make diagnostic discrimination a challenging task. We introduce a framework that combines the geospatial concept of geographically weighted regression (GWR) with a density function-based classification model, called GaWRDenMap. In this study, we applied this framework to an internal cohort of multiplexed immunofluorescence images from 228 patients from 6 different pancreatic disease cohorts, namely Chronic Pancreatitis (CP), PDAC, intraductal papillary mucinous neoplasm (IPMN), mucinous cystic neoplasm (MCN), pancreatic intraductal neoplasia (PanIN) and IPMN-associated PDAC. Epithelial and Immune cells were our covariates of interest for this proof-of-concept study. The density features obtained from the GWR model were used as input features for the pairwise disease classification models. We find that the model can best distinguish between CP and PDAC (AUC = 0.875), and PDAC and IPMN(AUC=0.753), with appreciable classification performance across other pairwise comparisons. We summarize that the output of our framework provides a set of spatially-information interaction features, which can be representative of the overall immune-epithelial interaction across patients with different diseases.

Virtual: Genome-scale Protein Interactome: Unraveling Novel Host Targets in Wheat-Stem Rust Pathosystem
COSI: GenCompBio
  • Raghav Kataria, Utah State University, United States
  • Rakesh Kaundal, Utah State University, United States


Presentation Overview: Show

Elucidation of plant-microbe interactions has accelerated the understanding of disease infection mechanisms, and how plant immune signaling molecules behave under different stress conditions. We employed two computational approaches, homology-based interolog and domain-based, to predict protein-protein interactions (PPIs) between Triticum aestivum and Puccinia graminis species (Pgt 21-0 and Pgt Ug99), that causes stem rust. T. aestivum-Pgt Ug99 and T. aestivum-Pgt 21-0 interactomes consisted of ~56M and ~90M putative PPIs, respectively. 34 Pgt Ug99 and 115 Pgt 21-0 potential effectors were identified. The proteins were enriched in significant GO terms (oxidoreductase activity, regulation of response to stimulus, chloroplast envelope), and KEGG pathways (MAPK signaling pathway, plant-pathogen interaction pathway) that are involved in immune response generation against the pathogen attack. The highly connected host protein hub belonged to Ser/Thr and cyclin-dependent kinase, which actively respond in various plant stress conditions. Subcellular localization prediction of the proteins revealed an appropriate site of plant-microbe protein-protein interactions. 5,577 stress-related transcription factors and novel, disease-resistant host targets were also identified. This is the first study to report the use of advanced computational approaches to decipher genome-scale host-pathogen PPIs, thus enabling the researchers to better understand the pathogen infection mechanisms and develop disease-resistant lines.

Virtual: Identifying cell cluster-specific gene-gene interactions for single cell transcriptomes using association rule mining
COSI: GenCompBio
  • Dibyendu B. Seal, University of Calcutta, India
  • Vivek Das, Novo Nordisk A/S, India
  • Rajat K. De, Indian Statistical Institute Kolkata, India


Presentation Overview: Show

Single cell RNA-sequencing (scRNA-seq) technologies have allowed researchers to investigate transcriptional regulation at a cellular resolution and derive biologically significant inferences. One such analysis often involves extracting statistically significant cell clusters that enable cell-type identification, based on the presence or absence of canonical markers. However, cells with similar gene expression profiles, may sometimes, represent variable transcriptional states. Identifying cell-type specific markers, is thus, not sufficient enough to understand the underlying molecular activity within a cell cluster. Rather, key regulators within cell clusters should be identified that can better describe the underlying transcriptional variability and gene-gene interactions. In order to assess the cells' functionality, genes driving or being driven by the markers need to be analysed against reference databases. In this work, we have proposed a Association Rule Mining (ARM)-based framework that can identify major gene-gene interactions within a cell-cluster in scRNA-seq data. These interaction networks have helped us identify key regulators, some of which have been found to be relevant canonical markers produced by benchmark methods. Further analysis of these sub-networks formed by hub or marker genes along with their neighbours, via Over Representation Analysis-based pathway enrichment, has revealed interesting functional characteristics that could be important for downstream biological interpretations.

Virtual: Inherited rare deleterious variant load alters cancer risk, age of onset and tumor immune microenvironment
COSI: GenCompBio
  • Myvizhi Esai Selvan, Icahn School of Medicine at Mount Sinai, United States
  • Kenan Onel, Icahn School of Medicine at Mount Sinai, United States
  • Sacha Gnjatic, Icahn School of Medicine at Mount Sinai, United States
  • Robert J. Klein, Icahn School of Medicine at Mount Sinai, United States
  • Zeynep H. Gümüş, Icahn School of Medicine at Mount Sinai, United States


Presentation Overview: Show

Recent studies show that rare, deleterious variants (RDVs) in certain genes are critical determinants of heritable cancer risk due to their high penetrance. Better understanding the role of RDVs in cancer will contribute to improved precision prevention, screening and treatment.

Towards this goal, we performed the largest-to-date jointly processed germline multi-cancer case-control association study from existing whole-exome sequencing data of 20,789 participants, split into discovery and validation cohorts. Specifically, we focused on RDVs annotated for pathogenicity through ClinVar using a rigorous analysis framework. For increased statistical power, we pursued a collapsing approach, and examined the cumulative effects of RDVs in functionally related gene-sets.

Our results confirmed and extended known associations between cancer risk and germline RDVs in genes involved in DNA repair, cancer predisposition, and somatic cancer drivers. Furthermore, we found that participants with multiple RDVs (personal RDV load) in these gene-sets are associated with increased cancer risk, earlier age of diagnosis, increased M1 macrophages in tumor and, in specific cancers increased tumor mutational burden.

These findings will increase our understanding of RDVs and can be used towards identifying high-risk individuals, who can then benefit from increased surveillance and treatments that exploit their tumor characteristics, improving prognosis.

Virtual: Integrated, consensus-based approach to characterize alternative splicing in Peripheral Arterial Disease (PAD)
COSI: GenCompBio
  • Julián Candia, National Institute on Aging, National Institutes of Health (Baltimore, MD, USA), United States
  • Ceereena Ubaida-Mohien, National Institute on Aging, National Institutes of Health (Baltimore, MD, USA), United States
  • Nirad Banskota, National Institute on Aging, National Institutes of Health (Baltimore, MD, USA), United States
  • Supriyo De, National Institute on Aging, National Institutes of Health (Baltimore, MD, USA), United States
  • Mary McDermott, Feinberg School of Medicine, Northwestern University (chicago, IL, USA), United States
  • Luigi Ferrucci, National Institute on Aging, National Institutes of Health (Baltimore, MD, USA), United States


Presentation Overview: Show

Background: Alternative splicing endows organisms with the capacity to adapt their gene and protein expression profiles in response to changing needs under a wide range of conditions, playing a key role in aging and disease. The goals of this work are two-fold, namely (1) to reveal alternative splicing mechanisms of muscle damage and resilience in PAD, and (2) to establish a robust computational pipeline to characterize alternative splicing events from RNA-Seq data in a broader class of studies of normal aging and aging-related diseases.

Method: Existing methods are classified as isoform-based (which reconstruct full-length transcripts, e.g. RSEM, Kallisto, Salmon, DiffSplice), exon-based (biologically less relevant but computationally more reliable, e.g. DEX-Seq, edgeR, JunctionSeq, limma) or event-based (which quantify different types of splicing event, e.g. rMATS, SUPPA, MAJIQ, dSpliceType). Due to the prevalence of false positive bias, we implemented a robust and conservative computational pipeline, which required consensus across all three types of approach.

Results and Significance: We found 275 differentially utilized exons, corresponding to 201 transcripts and 150 genes, validated previous observations, and identified significantly enriched pathways. By integrating complementary approaches, we developed a computational pipeline to characterize alternative splicing in PAD. This framework is broadly applicable to other aging-related studies.

Virtual: Interpretation and analysis of cellular morphologies from single-cell resolution sequential fluorescent in-situ hybridization data
COSI: GenCompBio
  • Qian Zhu, Dana Farber Cancer Institute, United States
  • Guo-Cheng Yuan, Icahn School of Medicine at Mount Sinai, United States


Presentation Overview: Show

Cells in intact tissues consist of a diverse set of cell types and often reside in distinct spatial compartments. Recent image-based multiplexed technologies combine histology staining with RNA sequential hybridization, enabling simultaneous measurement of gene expression and morphology in single cells, thereby providing a great opportunity to systematically investigate the relationship between these two commonly used cell-state classification approaches. We present a study analyzing a transcriptome-scale super-resolved SeqFISH+ dataset as well as two other SeqFISH datasets of the mouse cortical regions. We develop a computational approach to systematically characterize the relationship between cellular morphology and gene expression, which involves first acquiring accurate cell morphology information from Nissl and Dapi images. Then we apply a pretrained convolutional neural network to extract morphological features and provide an interpretation of the feature space in terms of predicting gene expression states. Our findings reveal that genes whose expression is associated with morphology not only correspond to markers of cell types and spatial domains, but additionally they include genes that participate in extracellular matrix-receptor signaling and signal transduction pathways. Our analysis provides novel insights into the relationship between morphology- and transcriptome-defined cell states.

Virtual: Leveraging the Google Cloud and BigQuery for Cancer-data analysis on the ISB Cancer Gateway in the Cloud (ISB-CGC)
COSI: GenCompBio
  • Fabian Seidl, ISB-CGC, GDIT, United States
  • Poojitha Gundluru, ISB-CGC, GDIT, United States
  • Boris Aguilar, ISB-CGC, ISB, United States
  • David Pot, ISB-CGC, GDIT, United States
  • William Longabaugh, ISB-CGC, ISB, United States


Presentation Overview: Show

Rapid growth of cancer data in recent decades has made data discovery and wrangling difficult for the average cancer research lab. Our mission at the ISB Cancer Gateway in the Cloud (ISB-CGC), part of the NCI’s Cancer Research Data Commons ecosystem, is to democratise access to large cancer datasets. Funded by the NCI, we have performed ETL processes on data from GDC and PDC projects such as TCGA, TARGET, and CPTAC. We generated hundreds of BigQuery tables containing data such as mutations, gene expression, and protein abundance, which enable data analysis in the cloud via SQL. BigQuery analyses are inexpensive and rapid even when scaled to petabyte sized inputs, for example we ran 6.6 billion correlations in 2.5 hours with a total cost of only $1.16. These data can also be accessed cheaply from Google Cloud VMs where researchers can develop analysis pipelines in Python, R, and workflow languages such as CWL. We present two recent collaborations: in one we generated tools that leverage a host of genomic data to improve the discoverability of synthetic lethal gene pairs in cancers, in the other we helped characterise the disparities observed between African American and White breast cancer patients.

Virtual: One Cell At A Time: a unified framework to integrate and analyze single-cell RNA-seq data
COSI: GenCompBio
  • Lin Zhang, University Health Network, Canada
  • Chloe Wang, University Health Network, Canada
  • Bo Wang, University Health Network, Canada


Presentation Overview: Show

The surge of single-cell RNA sequencing (scRNA-seq) technologies gives rise to the abundance of large scRNA-seq datasets at the scale of hundreds of thousands of single cells. Integrative analysis of large-scale scRNA-seq datasets can aggregate complementary biological information from different datasets and have the potential of revealing de novo cell types. However, most existing methods fail to integrate multiple large-scale scRNA-seq datasets in a computational and memory efficient way. Our recent work OCAT, One Cell At A Time, a machine learning method that sparsely encodes single-cell gene expressions to integrate data from heterogeneous sources without highly variable gene selection or explicit batch effect correction. We have demonstrated that OCAT efficiently integrates multiple scRNA-seq datasets and achieves the state-of-the-art performance in cell type clustering, especially in challenging scenarios of non-overlapping cell types. In addition, OCAT can efficaciously facilitate a variety of downstream analyses, such as differential gene analysis, trajectory inference, pseudo time inference and cell type inference. OCAT is a unifying tool to simplify and expedite the analysis of large-scale scRNA-seq data from heterogeneous sources.

Virtual: PhycoMine: A omics warehouse for Microalgae
COSI: GenCompBio
  • Rodrigo R. D. Goitia, University of São Paulo, Brazil
  • Diego Mauricio Riaño-Pachón, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil
  • Alexandre Victor Fassio, University of São Paulo, Brazil
  • Flavia Vischi Winck, Center for Nuclear Energy in Agriculture, University of São Paulo, Brazil


Presentation Overview: Show

PhycoMine is data warehouse system created on top of InterMine to ease the analysis of complex and integrated data from microalgae species. PhycoMine has an extended database model, a series of tools and widgets created to facilitate simultaneous data mining of different datasets. Among the widgets implemented in PhycoMine, there are options for mining chromosome distribution and gene expression data (proteomics and transcriptomics), to evaluate enrichment of Gene Onthology terms, KEGG pathways, publications, EggNOG, Transcription factors and transcriptional regulators, as well as phenotypic data. We have included so far 200 RNA-seq datasets from Chlamydomonas reinhardtii. With this platform, users can perform data mining based on a list of genes/proteins of interest in an integrated way through accessing the data from different sources and visualizing them with the option of exporting the data into table formats. The PhycoMine platform is freely available at https://PhycoMine.iq.usp.br.

Virtual: Scalable in-memory paradigm for genomics data processing
COSI: GenCompBio
  • Vadim Elisseev, IBM, United Kingdom
  • Laura-Jayne Gardiner, IBM, United Kingdom
  • Ritesh Krishna, IBM, United Kingdom


Presentation Overview: Show

Disk storage and access incur huge latency in processing of genomics datasets. To accelerate
downstream data processing, the data needs to be closer to the processor and available in fast access
memory devices. Historically, it was difficult to achieve this at larger scale due to cost and architectural
constrains around dynamic random-access memory (DRAM) technology. There have
been rapid improvements in memory technologies and computer architectures that allow cost effective
solutions for processing large amount of data in significantly less time. In-memory paradigm takes
advantage of the new architectural designs, where large memory pool can be created within a cluster
or a cloud by means of distributed in-memory databases or directly by aggregating individual nodes
together into a shared memory system. An in-memory paradigm can minimize the latency associated with the traditional HPC and Cloud based bioinformatics workflows, where tools are stitched in a sequential manner and output from one tool feeds as input for the next tool in the workflow, and at each stage of the workflow a
significant amount of secondary disk based I/O is performed. We created this study to investigate
if it is possible to utilize in-memory technologies, scalable in-memory databases, to accelerate genomics data processing.

Virtual: Small molecule screening on patient and synthetic leukemias reveals novel therapeutics
COSI: GenCompBio
  • Safia Safa-Tahar-Henni, Institute for Research in Immunology and Cancer of the Université de Montréal, Canada
  • Karla Paez Martinez, Institute for Research in Immunology and Cancer of the Université de Montréal, Canada
  • Brian Wilhelm, Institute for Research in Immunology and Cancer of the Université de Montréal, Canada


Presentation Overview: Show

Research on Acute myeloid leukemia (AML) has identified some recurrent genetic drivers, such as translocations of the KMT2A gene rearrangements which are present in ~ 65% of infant AML. Translocations of the KMT2A gene have been shown to involve more than 120 different partner genes and the specific partner gene involved can also dramatically impact on patient’s survival rate. The molecular biology of the KMT2A-AML genes remains incompletely characterized and is complicated by the genetic heterogeneity seen in patients. The objective of this project is to try to identify novel potential therapeutics and to understand the development of the disease through a large scale small molecule screen of patient and synthetic leukemias. More than 11000 compounds were used to screen leukemia samples at a single dose primary screen in order to identify compounds with significant anti leukemic activity (compounds that kill KMT2A-MLL3 AML samples but not normal cord blood CD34+ control cells. After validation of our initial hits by a dose response analysis, we identified Compound 1, and inhibitor of nucleotide synthesis, that shows specific activity for AML KMT2A-AF9 fusion.

Virtual: Systematic discovery of regulatory motifs associated with the insulator function of human enhancer-promoter interactions
COSI: GenCompBio
  • Naoki Osato, Waseda University, Japan
  • Michiaki Hamada, Waseda University, Japan


Presentation Overview: Show

Chromatin interactions are essential in enhancer-promoter interactions (EPIs) and transcriptional regulation. CTCF and cohesin proteins located at chromatin interaction anchors. However, there is still no overall understanding of proteins associated with chromatin interactions and insulator functions. Here, we describe a systematic and comprehensive deep-learning-based approach for discovering DNA-binding motifs of transcription factors (TFs) associated with insulator function, EPIs, and gene expression. This analysis identified 98 directional and non-directional motifs that significantly affected the expression level of putative transcriptional target genes in human foreskin fibroblast cell, and included the following known TFs associated with insulator function or interacted with an insulator TF: CTCF, cohesin (RAD21 and SMC3), BATF, BCL6, FOS, FOXA3, HNF4A, JUN, MAZ, MECP2, MYB, MYOD1, PAX5, PRDM9, SIN3A, SMAD2, SMAD3, SPI1, TRIM28, USF1, VDR, and ZNF143. Most of the known TFs are associated with CTCF, but MAZ is reported to have insulator function independently. These findings and methods contribute to reveal novel functions of TFs and gene regulation.

Virtual: The effect of genomic 3D structure on CRISPR cleavage efficiency
COSI: GenCompBio
  • Shaked Bergman, Tel Aviv University, Israel
  • Tamir Tuller, Tel Aviv University, Israel


Presentation Overview: Show

CRISPR is a gene editing technology which enables precise in-vivo genome editing. But its potential is hampered by its relatively low specificity and sensitivity. Improving CRISPR’s on-target and off-target effects requires a better understanding of its mechanism and determinants. Here we demonstrate, for the first time, the chromosomal 3D spatial structure’s effect on CRISPR’s cleavage efficiency, and its predictive capabilities.

We used high-resolution Hi-C data to estimate the 3D distance between different regions in the human genome and used these spatial properties to generate 3D-based features, characterizing each region’s density. We evaluate these features based on empirical, in-vivo CRISPR efficiency data and compare them to 5 state-of-the-art CRISPR efficiency models. The 3D features improved the models’ combined R2 by 24.2%, and their correlation to the empirical CRISPR efficiency was higher than 3 of the models’.

The features indicated a uniform relation between the 3D properties of the target site and its CRISPR efficiency: sites with lower spatial density demonstrated higher efficiency. Understanding how CRISPR is affected by the 3D DNA structure provides insight into CRISPR’s mechanism in general and improves our ability to correctly predict CRISPR’s cleavage as well as design gRNAs for therapeutic and scientific use.

Virtual: The in-silico and in-vitro characterization of epigenetic drugs (BET Protein Inhibitors and related analogs) on a colorectal cell line (HCT116)
COSI: GenCompBio
  • Grace Zang, Aspiring Scholars Directed Research Program, United States
  • Prabhav Pragash, Aspiring Scholars Directed Research Program, United States
  • Aditi Deshpande, Aspiring Scholars Directed Research Program, United States
  • Sanjana Selvaraj, Aspiring Scholars Directed Research Program, United States
  • Sowkya Namburu, Aspiring Scholars Directed Research Program, United States
  • Clinton Cunha, Aspiring Scholars Directed Research Program, United States


Presentation Overview: Show

Members of the bromodomain and extra-terminal domain (BET) family can lead to the overexpression of oncogenes (Shorstova et al., 2021). BET inhibitors (BETi), moderately reduce colorectal cancer cell (CRC) proliferation and MYC expression when used in monotherapy ​​(Ma et al., 2016).This study aims to determine potential BETi in colorectal cancer (in silico) and to identify the effects of these drugs on CRC (in vitro). JQ1, an extra terminal BET protein inhibitor that suppresses tumor progression, will be the control (Wen et al., 2020). PubChem datasets will be converted into a numerical format, chemical fingerprints, and used with unsupervised learning algorithms (Sydow et al., 2019) to assess the molecules’ similarity to each other and JQ1. After clustering, relevant clustered drugs will be molecular docked with Autodock Vina (Trott et al., 2010) and Rxdock (Ruiz-Carmona et al., 2014). The drugs found to have the greatest binding affinity to BRD4 will then be computationally tested on colon cancer cells using DeepCDR (Liu et al, 2020). These drugs will then be synthesized and tested on colorectal cancer cells to measure their effects with procedures including MTT (Freimoser et al, 1999) and qPCR (Mullis, 1985) to measure cell viability and gene expression levels.

Virtual: The in-silico characterization of epigenetic drugs (for epigenetic targets such as DNA Methyltransferase) on a colorectal cell line (HCT116)
COSI: GenCompBio
  • Aksithi Eswaran, Aspiring Scholars Directed Research Program, United States
  • Ashley Lin, Aspiring Scholars Directed Research Program, United States
  • Shikha Kathrani, Aspiring Scholars Directed Research Program, United States
  • Clinton Cunha, Aspiring Scholars Directed Research Program, United States


Presentation Overview: Show

In cancerous states, DNA methyltransferase 1 (DNMT1) can silence tumor suppressor genes (Hu, Liu, Zeng, et al., 2021). DNMT1 inhibitors reverse this process and activate the genes again (Hu, Liu, Zeng, et al., 2021). We are taking a computational approach to finding novel treatments to colon cancer with the processes with unsupervised learning (Rhys, 2020) and molecular docking (Meng, et al., 2011). With unsupervised learning (McInnes, et al., 2018; Corsello, et al., 2020), we will use the knowledge of pre-existing inhibitors to cluster compounds together from a ChemBL (Mendez, et al., 2018) dataset representing the chemical space. Then using Avogadro (Hanwell, et al., 2012), Orca (Neese, et al., 2011), and AutoDock Vina (Trott, Olson, 2010), we will batch dock those similar compounds with AutoDock Vina. With the binding affinities that AutoDock Vina outputs, we can narrow down the list of possible drugs that will be effective against colon cancer. Finally, we will use DeepCDR to, in silico, to determine different drugs’ efficiency on colon cancer cell lines based on the transcriptomic, genomic and epigenomic data of cell lines along with previous drug-cell line pair patterns (Liu et al, 2020). We hope to find novel compounds using recognized software tools.

Virtual: The spring-mass model and other reductionist models of bipedal locomotion on inclines
COSI: GenCompBio
  • Alessandro Maria Selvitella, Purdue University Fort Wayne and eScience Institute, University of Washington, United States
  • Kathleen Lois Foster, Ball State University, United States


Presentation Overview: Show

The spring-mass model has been extensively investigated for locomotion over horizontal surfaces, but largely neglected on other ecologically relevant surfaces, including inclines. In this work, we extend the spring-mass model to inclined surfaces. We derive an approximate solution of the system, assuming a small angular sweep of the limb and a small spring compression during stance. We show that this approximation is very accurate, especially for small inclinations, and discuss locomotor stability questions of the approximate solutions. We perform a sensitivity analysis using parameters relevant to the locomotion of bipedal animals (quail, pheasant, guinea fowl, turkey, ostrich, and humans). We compare the two-dimensional spring-mass model on inclines with the one-dimensional spring-mass model (limit of no horizontal velocity) and with the inverted-pendulum model (limit of high stiffness-to-mass ratio) on inclines. We include comparisons between no-gravity approximations of these models. The insights we have gleaned through all these comparisons and the ability of our approximation to replicate some of the kinematic changes observed in animals moving on different inclines (e.g. reduction in vertical oscillation of the center of mass and decreased stride length) underlines the valuable and reasonable contributions that very simple, reductionist models, like the spring-mass model, can provide.

Virtual: Transfer of Anolis locomotor behaviour across environments and species
COSI: GenCompBio
  • Kathleen Lois Foster, Ball State University, United States
  • Alessandro Maria Selvitella, Purdue University Fort Wayne and eScience Institute, University of Washington, United States


Presentation Overview: Show

Anolis lizards are remarkable in the apparent ease with which they conquer heterogeneous environments and maintain stable locomotion on widely disparate surfaces. Here, we analyze the limb movements of two trunk-crown Anolis ecomorphs, A. carolinensis and A. evermanni, running on 6 different surfaces (3 inclinations x 2 perch diameters), from the perspective of Transfer Learning. We show that the strategies employed to improve locomotor stability on narrow perches are transferred across environments with different inclines. Further, behaviours used on vertical inclines are shared across perch diameters whereas the relationship between horizontal and intermediate inclines change on different perch diameters, leading to lower transfer learning accuracy of shallow inclines across perch diameters. Our results suggest that subtle differences exist in how A. carolinensis and A. evermanni adjust their behaviours in typical trunk-crown environments and that they may have converged on similar strategies for modulating forelimb behaviour on vertical surfaces and hind limb behaviour on shallow surfaces. This work is an example of how modern statistical methodology can provide interesting perspectives on new biological questions, such as on the role and nuances of behavioural plasticity and the key behaviours that help shape the versatility and rapid evolution of Anolis lizards.

Virtual: Uncovering the molecular underpinnings of high- and low-quality partnership between Medicago truncatula and Ensifer meliloti
COSI: GenCompBio
  • Muhammad Rizwan Riaz, University of Illinois at Urbana-Champaign, United States
  • Hanna Lindgreen, University of Illinois at Urbana-Champaign, United States
  • Rebecca Batstone, University of Illinois at Urbana-Champaign, United States
  • Ivan Marquez, University of Illinois at Urbana-Champaign, United States
  • Crissy Gallick, University of Illinois at Urbana-Champaign, United States
  • Katy Heath, University of Illinois at Urbana-Champaign, United States
  • Amy Marshall-Colon, University of Illinois at Urbana-Champaign, United States


Presentation Overview: Show

In mutualism, species share resources to provide benefits for both partners. Microbial symbiotic mutualisms are ubiquitous in natural and managed systems and play important roles in human health and plant productivity, these ecosystem services depend on the exchange of fitness benefits between the mutualist partners. In this study, the transcriptomic data of nodule tissue of Medicago truncatula, infected with 20 different strains of Ensifer meliloti, has been analyzed to explore the transcriptomic profiles by performing Differential Gene Expression analysis, WGCNA and Causality Analysis. Computational exploratory analysis of nodule transcriptome revealed specific gene modules and pathways varying among high and low-quality symbiotic strains, including genes located on pSymA, and correlated with plant fitness. Furthermore, we performed causality analysis to link genetic variation identified from GWAS on a large pool of Ensifer strains, to genes present in significant WGCNA modules to shoot biomass. We found significant causal relationship between several SNPs, including two found on adenylate cyclase and four on speC that are influencing shoot biomass via ten and forty-six bacterial genes, respectively. Experimental validation of statistically significant causal links is underway for high- and low-quality partner strains. This research highlights the importance of integrative study for studying the symbiotic partnership.

Virtual: Untargeted Transcriptomic Analysis of the Effects of Centella asiatica in Cortical Neurons
COSI: GenCompBio
  • Steven R Chamberlin, Oregon Health and Science University, United States
  • Shannon Mcweeney, OHSU, United States
  • Jonathon Zweig, OHSU, United States
  • Dan Bottomly, OHSU, United States
  • Cody Neff, OHSU, United States
  • Claudia Maier, OHSU, United States
  • Amala Soumyanath, OHSU, United States
  • Nora Gray, OHSU, United States


Presentation Overview: Show

The water extract of the Ayurvedic plant Centella asiatica (CAW) can increase synaptic plasticity and cognitive function, although the exact mechanism by which this occurs is not fully understood. To further explore potential mechanisms of actions we investigated transcriptomic changes that occur in primary cortical neurons treated with CAW or combinations of its constituent compounds.
Mouse primary cortical neurons were treated with CAW or 3 different groups of constituent compounds: triterpenes (TT), caffeoylquinic acids (CQA), combined TT and CQA (TT+CQA) at concentrations equivalent to their presence in CAW. Samples were analyzed by RNA-seq.
Differential gene expression analyses comparing each of the four treatment groups to control show 2667 genes that were significantly altered by CAW, 198 for CQA, 1760 for TT, and 981 for combined TT+CQA. All differences were significant after FDR adjustment. Preliminary pathway and network context evaluation has shown significant enrichment with CAW up-regulated genes in pathways related to collagen formation.
While previous work has focused on TT and CQAs as active CAW compounds, the current findings indicate the presence of other active compounds in CAW, as well as potential interactions between the TT and CQA compounds.

G-001: CosTaL: An accurate and scalable graph-based clustering algorithm for high-dimensional single-cell data analysis
COSI: GenCompBio
  • Yijia Li, University of Minnesota, United States
  • Edgar Arriaga, University of Minnesota, United States
  • David Anastasiu, Santa Clara University, United States


Presentation Overview: Show

With the aim of analyzing large-size multidimensional single-cell datasets, which are more and more common nowadays, we are reporting the strategy of CosTaL (Cosine-based Tanimoto similarity-pruned graph for community detection by Leiden algorithm) for clustering practice. Similar to the predecessors like PhenoGraph and PARC, CosTaL transforms the cells with high-dimensional features from omics data into a weighted k-nearest-neighbor (kNN) graph. The cells are converted to the vertices of the graph, while the close relatedness between similar cells are kept, represented by the weight of the edges between vertices. Specifically, CosTaL builds an exact kNN graph using cosine similarity and uses the Tanimoto coefficient as the pruning strategy to re-weight the edges for the graph to improve the accuracy of clustering. As a result, we demonstrate that CosTaL generally gets higher accuracy scores on seven benchmark cytometry datasets and six single-cell RNA-sequencing datasets using six different evaluation metrics, compared with other graph-based clustering methods, including PhenoGraph, Scanpy, and PARC. Additionally, CosTaL has the fastest computational time on large datasets, suggesting that CosTaL generally has better scalability over the other methods, which is beneficial for processing large-size datasets.

G-002: Meta-analysis of neuroblastoma single cell RNA-seq datasets identifies conserved and divergent gene expression programs across human and preclinical models
COSI: GenCompBio
  • Richard Chapple, St Jude Children's Hospital, United States
  • Sivaraman Natarajan, St Jude Children's Hospital, United States
  • William Wright, St Jude Children's Hospital, United States
  • Min Pan, St Jude Children's Hospital, United States
  • Hm Lee, St Jude Children's Hospital, United States
  • Anand Patel, St Jude Children's Hospital, United States
  • Michael Dyer, St Jude Children's Hospital, United States
  • John Easton, St Jude Children's Hospital, United States
  • Paul Geeleher, St Jude Children's Hospital, United States


Presentation Overview: Show

Neuroblastoma is a highly heterogeneous disease not only in the clinical presentation of individual patients, but also in the cellular composition of any given tumor. Insights into this diversity have only recently been enabled due to advancements in single cell technologies, which have facilitated investigation of this disease at unprecedented resolution and detail. Coinciding with the growing number of scRNA-seq technologies, so too are the number of single cell datasets encompassing neuroblastoma patients across several institutions. However, due to the rarity of the affliction and sample access, the cohort pool in each aforementioned scRNA-seq study is limited to a reduced representation of the spectrum of disease classifications, which limits the ability of any single study to draw conclusions about neuroblastoma as a whole. Moreover, inconsistencies in data acquisition and analytical approaches across these studies have led to diverging interpretations. As such, we decided to amass the entirety of publicly available neuroblastoma scRNA-seq studies, representing a more comprehensive cross-section of patient presentations, towards the goal of conducting an exhaustive meta-analysis of the underlying data. To this end, we have implemented a generalizable non-negative matrix factorization (NMF)-based framework targeted at discovering conserved gene expression programs in humans and preclinical models of neuroblastoma.

G-003: Proteogenomics analysis to identify acquired resistance-specific alterations in melanoma PDXs on MAPKi therapy
COSI: GenCompBio
  • Kanishka Manna, University of Arkansas for Medical Sciences (UAMS), United States
  • Prashanthi Dharanipragada, University of California, Los Angeles (UCLA), United States
  • Duah Alkam, University of Arkansas for Medical Sciences (UAMS), United States
  • Nathan Avaritt, University of Arkansas for Medical Sciences (UAMS), United States
  • Charity Washam, University of Arkansas for Medical Sciences (UAMS), United States
  • Michael Robeson, University of Arkansas for Medical Sciences (UAMS), United States
  • Ricky Edmondson, University of Arkansas for Medical Sciences (UAMS), United States
  • Zhentao Yang, University of California, Los Angeles (UCLA), United States
  • Yan Wang, University of California, Los Angeles (UCLA), United States
  • Shirley Lomeli, University of California, Los Angeles (UCLA), United States
  • Gatien Moriceau, University of California, Los Angeles (UCLA), United States
  • Stephanie Byrum, University of Arkansas for Medical Sciences (UAMS), United States
  • Roger Lo, University of California, Los Angeles (UCLA), United States
  • Alan Tackett, University of Arkansas for Medical Sciences (UAMS), United States


Presentation Overview: Show

Therapeutic approaches to treat melanoma include small molecule drugs that target activating protein mutations in pro-growth signaling pathways like the MAPK pathway. While beneficial to the approximately 50% of patients with activating BRAFV600 mutation, mono- and combination therapy with MAPK inhibitors is ultimately associated with acquired resistance. To better characterize the mechanisms of MAPK inhibitor resistance in melanoma, we utilize patient-derived xenografts and apply proteogenomic approaches leveraging genomic, transcriptomic, and proteomic technologies that permit the identification of resistance-specific alterations and therapeutic vulnerabilities. A specific challenge for proteogenomic applications comes at the level of data curation to enable multi-omics data integration. Here, we present a proteogenomic approach that uses custom curated databases to identify unique resistance-specific alternations in melanoma PDX models of acquired MAPK inhibitor resistance. We demonstrate this approach with a NRASQ61L melanoma PDX model from which resistant tumors were developed following treatment with a MEK inhibitor. Our multi-omics strategy addresses current challenges in bioinformatics by leveraging development of custom curated proteogenomics databases derived from individual resistant melanoma that evolves following MEK inhibitor treatment and is scalable to comprehensively characterize acquired MAPK inhibitor resistance across patient-specific models and genomic subtypes of melanoma.

G-004: Computational Prediction of COVID-19 Risky Genes associated with Lung Cancer
COSI: GenCompBio
  • Judy Bai, Greenhills School, United States
  • Yongsheng Bai, Eastern Michigan University, United States


Presentation Overview: Show

Lung Cancer is an uncontrolled division of faulty cells in lungs and could spread to other organs. It is also the third most-common cancer in the USA. Coronavirus Disease 2019 (COVID-19) is a virus causing lung infection with Severe Acute Respiratory Syndrome (SARS) and has been a global pandemic for a while. It has also been noticed that people with pre-existing medical conditions whose immune systems do not function correctly - or at all - due to cancer treatment (e.g., chemotherapy) are prone to be infected with COVID-19 and develop severe problems. There are also few studies so far on how genes associated with Lung Cancer could serve as targets of COVID-19. In this project, we used bioinformatics approaches to study genes and their molecular mechanisms contributing to Lung Cancer and the COVID-19 disease. Specifically, we first calculated expressions for literature-reported candidate genes associated with The Cancer Genome Atlas (TCGA) Lung Adenocarcinoma (LUAD), next we conducted Protein-Protein Interaction (PPI) network analysis for 30 candidate down-regulated genes, then we performed functional annotation, pathway, and survival analysis. Afterwards we identified conserved domains on them. Finally, we cross-checked the SARS-CoV-2 infection studies/literatures and pinpointed 4 surfactant genes that could serve as potential biomarkers.

G-004: Computational Prediction of COVID-19 Risky Genes associated with Lung Cancer
COSI: GenCompBio
  • Judy Bai, Greenhills School, United States
  • Yongsheng Bai, Eastern Michigan University, United States


Presentation Overview: Show

Lung Cancer is an uncontrolled division of faulty cells in lungs and could spread to other organs. It is also the third most-common cancer in the USA. Coronavirus Disease 2019 (COVID-19) is a virus causing lung infection with Severe Acute Respiratory Syndrome (SARS) and has been a global pandemic for a while. It has also been noticed that people with pre-existing medical conditions whose immune systems do not function correctly - or at all - due to cancer treatment (e.g., chemotherapy) are prone to be infected with COVID-19 and develop severe problems. There are also few studies so far on how genes associated with Lung Cancer could serve as targets of COVID-19. In this project, we used bioinformatics approaches to study genes and their molecular mechanisms contributing to Lung Cancer and the COVID-19 disease. Specifically, we first calculated expressions for literature-reported candidate genes associated with The Cancer Genome Atlas (TCGA) Lung Adenocarcinoma (LUAD), next we conducted Protein-Protein Interaction (PPI) network analysis for 30 candidate down-regulated genes, then we performed functional annotation, pathway, and survival analysis. Afterwards we identified conserved domains on them. Finally, we cross-checked the SARS-CoV-2 infection studies/literatures and pinpointed 4 surfactant genes that could serve as potential biomarkers.

G-005: Assessing Experimental Model Similarity to Patient Samples through Feature-Weighted Molecular Profiles using TumorComparer
COSI: GenCompBio
  • Rileen Sinha, Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA, United States
  • Augustin Luna, Department of Cell Biology, Harvard Medical School, Boston, MA, USA, United States
  • Nikolaus Schultz, Computational Oncology, Memorial Sloan Kettering Cancer Center, New York, NY, USA, United States
  • Chris Sander, Department of Cell Biology, Harvard Medical School, Boston, MA, USA, United States


Presentation Overview: Show

Cancer is a genetic disease, typically marked by widespread somatic alterations (e.g., mutations, copy-number alterations, and gene expression changes). However, not all changes are functionally important; few genes can promote oncogenesis (also termed “cancer drivers”), whereas other altered genes have little effect on the phenotype (termed “passengers”). Furthermore, many research questions focus on particular genes and their activity (e.g., specific signaling pathways, drug targets, etc.). This motivates the need for a flexible method of comparing tumors with potential cell line models by using researcher-selected properties. We present TumorComparer, a computational comparison method based on weighted features to allow expert- and knowledge-driven comparison of tumors and experimental models, such as cell lines or organoids. We apply TumorComparer to the comparison of ∼8,000 tumors and ∼600 cell lines across 24 cancer types as an initial application to provide a general, pan-cancer resource based on knowledge of oncogenic alterations gained from The Cancer Genome Atlas program (TCGA). TumorComparer is provided as an open-source R package (github.com/sanderlab/tumorcomparer) and interactive web application (tumorcomparer.org) for customized analyses. TumorComparer is a generally applicable method suitable for pre-clinical cancer research and personalized medicine applications where sets of samples need to be assessed for similarity.

G-006: EpiMix: an integrative tool for resolving epigenetic heterogeneity using DNA methylation
COSI: GenCompBio
  • Yuanning Zheng, Stanford Univeristy, United States
  • John Jun, Stanford University, United States
  • Kevin Brennan, Stanford University, United States
  • Olivier Gevaert, Stanford University, United States


Presentation Overview: Show

Emerging evidence has revealed the regulatory roles of DNA methylation (DNAme) on protein-coding genes and non-coding RNAs, and recent technologies enable genome-wide quantification of DNAme in large human cohorts. This creates the need to use a model-based computational approach to resolve the epigenetic heterogeneity in large human cohorts and to pinpoint the individuals carrying differential methylation profiles. Here we developed EpiMix, a comprehensive tool for population-level analysis of DNAme. EpiMix allows us to detect abnormal DNAme that were presented in only small subsets of a patient cohort and to identify DNAme-associated disease subtypes. Furthermore, we applied this model-based approach to identify abnormal DNAme at functionally diverse genomic elements, including cis-regulatory elements within protein-coding genes, distal enhancers, and genes encoding microRNAs and lncRNAs. In two separate studies, we showed that EpiMix discovered novel epigenetic mechanisms underlying childhood food allergy and survival-associated, methylation-driven non-coding RNAs in non-small cell lung cancer. EpiMix is available as an R package and a web-based tool: https://epimix.stanford.edu

G-007: Detecting differential initiations of transcripts at single cell type level
COSI: GenCompBio
  • Shuhua Fu, Washington University in St. Louis, United States
  • Parker Wilson, Washington University in St. Louis, United States
  • Benjamin Humphreys, Washington University in St. Louis, United States
  • Bo Zhang, Washington University in St. Louis, United States


Presentation Overview: Show

Motivation
Genes can produce different transcript isoforms by using distinct Transcriptional Start Regions (TSRs), which are recognized by RNA-Polymerase II and regulated by cell-type-specific expressed transcription factors, and eventually contribute to the formation of tissue and cell-type specificity. However, the alternative usages of gene TSRs among tissues and cell types are still largely unknown.
Results
To systematically explore the alternative usage of gene TSRs, we developed TSRdetector. This novel bioinformatic method is specifically designed to detect the significantly altered use of gene TSRs among different tissues, cell types, or diseases. TSRdetector can process both single-cell RNA-seq or bulk RNA-seq transcriptome data, automatically calculate and define tissue-dominant TSRs and further compute the differential usage of gene TSRs in given conditions. We applied TSRdetector to analyze the single-nucleus RNA-seq dataset of healthy and diabetic human kidneys. Between the two major cell types (DCTPC and PT) of the human kidney, 268 genes were found to have significant differential usage of dominant TSRs, and 31% derived the significant differential gene expression between the two cell types. In three major cell types (DCTPC, PT and LOH) of human diabetic samples, we discovered 237 genes that significantly altered the usage of dominant TSRs, including nine critical diabetes-related genes, such as ATP5PD, MTR, and HMGCR. We further analyzed the coding capacity of transcripts that altered dominant TSRs, and found that most of these genes could be translated to different downstream protein products but maintain the stable mRNA expression level. We also applied TSRdetector to mouse B-cell Smart-seq2 data across different developmental time points and identified 100 genes with significantly altered TSR usages, including 11 epigenetic factors.

G-008: FASTAptameR 2.0: A Web Server for Combinatorial Sequence Selections
COSI: GenCompBio
  • Skyler Kramer, University of Missouri - Columbia, Division of Biological Sciences, United States
  • Paige Gruenke, Department of Biochemistry, University of Missouri - Columbia, United States
  • Khalid Alam, Stemloop Inc., United States
  • Dong Xu, Univ. of Missouri-Columbia, United States
  • Donald Burke, Department of Molecular Immunology and Microbiology and Department of Biochemistry, University of Missouri - Columbia, United States


Presentation Overview: Show

Combinatorial selections are powerful strategies for identifying biopolymers with specific biological, biomedical, or chemical characteristics and understanding fitness and selection dynamics. These experiments can generate large volumes of data, thus driving the need for high-throughput sequencing (HTS) analysis tools. Although the selections field has recently benefited from several software tools for HTS analysis, these tools have a high entry barrier for many users because they require command-line access or extensive programming expertise. FASTAptameR 2.0 is an R-based reimplementation of FASTAptamer designed to minimize this entry barrier while maintaining the ability to answer complex sequence- and population-level questions. This open-source toolkit features a user-friendly web server, interactive graphics, expanded module set, up to 100x faster clustering than FASTAptamer, and an extensive user guide. FASTAptameR 2.0 accepts diverse inputs (such as aptamers, ribozymes, xenonucleic acids, or peptides) and can be applied to any sequence-encoded selection. FASTAptameR 2.0 is available as a web tool (https://fastaptamer2.missouri.edu/), Docker image (skylerkramer/fastaptamer2), or GitHub repository (SkylerKramer/FASTAptameR-2.0).

G-009: Machine learning development environment for single-cell sequencing data analyses
COSI: GenCompBio
  • Lei Jiang, University of Missouri, Columbia, United States
  • Yuexu Jiang, University of Missouri, Columbia, United States
  • Cankun Wang, The Ohio State University, United States
  • Clement Essien, University of Missouri, Columbia, United States
  • Juexin Wang, University of Missouri, Columbia, United States
  • Anjun Ma, The Ohio State University, United States
  • Qin Ma, The Ohio State University, United States
  • Dong Xu, University of Missouri, Columbia, United States


Presentation Overview: Show

Machine learning (ML) is transforming the analysis of single-cell sequencing data; however, the barriers of technology complexity and biological knowledge remain challenging for the involvement of the ML community. We present an ML development environment for single-cell sequencing data analyses with a diverse set of realistic and accessible ML-Ready benchmark datasets. A cloud-based platform is built to dynamically scale pipelines for collecting, processing, and managing various single-cell sequencing data to make them ML-ready. Benchmarks, assessment utilities for evaluating results and report generation, and a code level and web interface integrated development environment (IDE) are also developed for supporting partial method development. Large-scale ML-ready datasets and benchmarks are applied for multiple single-cell analysis tasks, including clustering, marker gene identification, trajectory, cell-cell communication, and multi-omics data integration. Each benchmark is divided into training, validation, and test sets in multiple settings, including a minimum viable benchmark to assist efficient method development and a comprehensive benchmark for full evaluations. Automated end-to-end single-cell analyses ML pipelines are developed to simplify and standardize the process of single-cell data formatting, loading, model development, and model evaluation. The platform can significantly lower the method development barrier in single-cell data analyses for ML researchers.

G-010: Integrating High Throughput Transcriptomics into a Tiered Framework to Prioritize Chemicals for Toxicity Testing
COSI: GenCompBio
  • Jesse Rogers, U.S. Environmental Protection Agency, United States
  • Katie Paul-Friedman, U.S. Environmental Protection Agency, United States
  • Logan Everett, U.S. Environmental Protection Agency, United States


Presentation Overview: Show

US EPA is developing a tiered assessment strategy for chemical toxicity testing by integrating multiple data streams. Pairing high content assays such as high-throughput transcriptomics (HTTr) with high-throughput screening (HTS) of specific molecular targets may improve confidence when assessing key hazards. Here, we used HTTr screening data to generate new signatures representing known molecular targets, and signature-level potency estimates were integrated with complementary HTS assays as a proof-of-concept framework for chemical prioritization. Transcriptomic profiles generated via the TempO-Seq platform in two distinct cell lines were used to develop signatures comprised of genes selectively responsive to reference chemicals for one of 13 distinct molecular targets. Of 1,218 chemicals screened in HTTr to date, 232 chemicals demonstrated selective potency in at least one reference signature. In examining these chemicals using available orthogonal HTS assays from US EPA’s ToxCast program, 74 chemicals were confirmed as potential selective AHR, GR, or RAR/RXR nuclear receptor agonists. Our work demonstrates that HTTr data can inform putative molecular targets and identify chemicals for further screening in a framework to support chemical risk assessment. The views expressed in this abstract are those of the authors and do not necessarily reflect the views or policies of the US EPA.

G-011: Amino acid sequence assignment from single molecule peptide sequencing data using a two-stage classifier
COSI: GenCompBio
  • Matthew Smith, Oden Institute, The University of Texas at Austin, United States
  • Edward Marcotte, Department of Molecular Biosciences The University of Texas at Austin, COI: Erisyon co-founder, shareholder, SAB member, United States


Presentation Overview: Show

Tools for protein identification and quantification lag DNA and RNA sequencing techniques in sensitivity and throughput. To address this, our group invented fluorosequencing, a single molecule protein sequencing technology. In fluorosequencing, proteins are proteolytically digested into peptides, and specific amino acids are labeled with fluorescent dyes. Labeled peptides are immobilized in a flow-cell where, using Edman degradation chemistry, they are sequenced in parallel while being imaged by single molecule microscopy. Fluorosequencing produces sequencing reads from many individual molecules simultaneously, with a significant elevation in noise and errors that must be addressed in subsequent computational analysis.

We found that Hidden Markov Models representing the state changes of a peptide undergoing sequencing can provide excellent measures of the probability of a fluorosequencing read given that peptide, which we can in turn use for Bayesian classification. Naïve models did not scale to larger peptides with more labels, so we developed a number of novel algorithmic adjustments to our Hidden Markov Model implementation catered to address fluorosequencing data. Additionally we combined our brute-force Bayesian classifier with a k-Nearest-Neighbors classifier that reduces the number of Hidden Markov Models needed to be built and run.

G-012: Rat Reference Genome mRatBN7.2 Curation
COSI: GenCompBio
  • Wendy Demos, Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin, United States
  • Valerie A. Schneider, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Terence D. Murphy, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, United States
  • Monika Tutaj, Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin, United States
  • Jennifer R. Smith, Rat Genome Database, Department of Biomedical Engineering, Medical College of Wisconsin, United States
  • Anne E. Kwitek, Rat Genome Database, Department of Biomedical Engineering, Department of Physiology Medical College of Wisconsin, United States


Presentation Overview: Show

Rattus norvegicus (rat) is an important experimental model for human diseases. Previous rat genome references were highly fragmented despite periodic updates. The latest assembly, mRatBN7.2, addresses many deficiencies of prior assemblies but requires continued manual curation for reliability and optimization.
Reference genome issues are directly reported to the Genome Reference Consortium (GRC; https://www.ncbi.nlm.nih.gov/grc/report-an-issue) by RefSeq curators and the rat research community. Issues are assigned a ticket ID via the Atlassian JIRA Service Management platform and addressed by GRC curators at the Rat Genome Database (RGD). The ticket prioritization strategy is modeled after processes established by the GRC: effects on protein function (known (first), potential (second)), then sequence differences outside the coding region, giving priority to community reported issues. Ticket resolution relies heavily on public tools such as Genome Workbench (NCBI), JBrowse Genome Browser (genomics data produced and curated by RGD at the Medical College of Wisconsin), and additional curation tools available to curators on the GRC platform.
A workflow has been established to review and resolve tickets. Ticket resolution status updates are provided on the GRC webpage and are being integrated into RGD through gene pages and JBrowse Genome Browser and announced to the community through RGD social media.

G-013: Predicting Epitopes for SARS-CoV-2
COSI: GenCompBio
  • Akshay Agarwal, IBM, United States
  • Kristen Beck, IBM, United States
  • Sara Capponi, IBM, United States
  • Mark Kunitomi, IBM, United States
  • Gowri Nayar, IBM, United States
  • Edward Seabolt, IBM, United States
  • Gandhar Mahadeshwar, IBM, United States
  • Simone Bianco, IBM, United States
  • Vandana Mukherjee, IBM, United States
  • James Kaufman, IBM, United States


Presentation Overview: Show

Epitopes are short amino acid sequences that define the antigen signature to which an antibody binds. In light of the current pandemic, epitope analysis and prediction is paramount to improving serological testing and developing vaccines. In this paper, we leverage known epitope sequences from SARS-CoV, SARS-CoV-2 and other Coronaviridae and use those known epitopes to identify additional antigen regions in 62K SARS-CoV-2 genomes. Additionally, we present epitope distribution across SARS-CoV-2 genomes, locate the most commonly found epitopes, discuss where epitopes are located on proteins, and how epitopes can be grouped into classes. We also discuss the mutation density of different regions on proteins using a big data approach. We find that there are many conserved epitopes between SARS-CoV-2 and SARS-CoV, with more diverse sequences found in Nucleoprotein and Spike Glycoprotein.

G-014: Universal and tissue of origin specific DNA methylation markers in aerodigestive cancers
COSI: GenCompBio
  • Zhifu Sun, Mayo Clinic, United States
  • William Taylor, Mayo Clinic, United States
  • Saurabh Behati, Mayo Clinic, United States
  • Seth Slettedahl, Mayo Clinic, United States
  • Douglas Mahoney, Mayo Clinic, United States
  • John Kisiel, Mayo Clinic, United States


Presentation Overview: Show

Aerodigestive cancers account for a third of new cancer cases and about half of total cancer deaths in the US, which makes early diagnosis critical such as through minimally invasive liquid biopsy of cell-free DNA methylation. However, identifying reliable cancer-specific signal from the high background cell free DNA is very challenging. Using large datasets, we aimed to identify CpG sites or regions that can work the best as universal and tissue of origin specific markers. DNA methylation data from 2585 samples were randomly split into a training and a testing dataset. The universal cancer biomarkers were identified by comparing all tumors to normal samples and cancer site specific markers were selected by “one-vs-all others”. The selected markers were used to build machine learning prediction models. We identified 223 universal cancer markers which could achieve perfect sensitivity and 0.83 specificity in the test dataset and 0.90 sensitivity and 0.71 specificity in the internal sequencing data. The 341 cancer specific markers had overall accuracy of 0.97 in the test dataset. We have identified DNA methylation markers that are highly discriminative between cancer and normal tissues and between different aerodigestive cancers that can be potentially used for plasma-based cancer detection.

G-015: Clonally selected lines after CRISPR/Cas editing are not isogenic
COSI: GenCompBio
  • Arijit Panda, Mayo Clinic, United States
  • Milovan Suvakov, Mayo Clinic, United States
  • Jessica Mariani, Yale University, United States
  • Kristen L. Drucker, Mayo Clinic, United States
  • Yohan Park, Oklahoma Medical Research Foundation, United States
  • Yeongjun Jang, Mayo Clinic, United States
  • Thomas M. Kollmeyer, Mayo Clinic, United States
  • Gobinda Sarkar, Mayo Clinic, United States
  • Taejeong Bae, Mayo Clinic, United States
  • Jean J. Kim, Baylor College of Medicine, United States
  • Wan Hee Yoon, Oklahoma Medical Research Foundation, United States
  • Robert B. Jenkins, Mayo Clinic, United States
  • Flora Vaccarino, Yale University, United States
  • Alexej Abyzov, Mayo Clinic, United States


Presentation Overview: Show

The CRISPR-Cas9 system has enabled researchers to precisely modify/edit the sequence of a genome. A typical editing experiment consists of two steps: (i) editing cultured cells; (ii) cell cloning and selection of clones with and without intended edit, presumed to be isogenic. The application of CRISPR-Cas9 system may result in off-target edits, while cloning will reveal culture-acquired mutations. We analyzed the extent of the former and the latter by whole genome sequencing in three experiments involving separate genomic loci and conducted by three independent laboratories. In all experiments we hardly found any off-target edits, while detecting hundreds to thousands of single nucleotide mutations unique to each clone after relatively short culture of 10-20 passages. Notably, clones also differed in copy number alterations that were several kb to several mb in size and represented the largest source of genomic divergence among clones. We suggest that screening of clones for mutations and copy number alterations acquired in culture is a necessary step to allow correct interpretation of DNA editing experiments. Furthermore, since culture associated mutations are inevitable, we propose that experiments involving derivation of clonal lines should compare a mix of multiple unedited lines and a mix of multiple edited lines.

G-016: Neural relational inference to learn long-range allosteric interactions in proteins from molecular dynamics simulations
COSI: GenCompBio
  • Juexin Wang, University of Missouri, United States
  • Jingxuan Zhu, Jilin University, China
  • Weiwei Han, Jilin University, China
  • Dong Xu, Univ. of Missouri-Columbia, United States


Presentation Overview: Show

Protein allostery is a biological process facilitated by spatially long-range intra-protein communication, whereby ligand binding or amino acid change at a distant site affects the active site remotely. Molecular dynamics (MD) simulation provides a powerful computational approach to probe the allosteric effect. However, current MD simulations cannot reach the time scales of whole allosteric processes. The advent of deep learning made it possible to evaluate both spatially short and long-range communications for understanding allostery. For this purpose, we developed and applied a neural relational inference model based on a graph neural network, which adopts an encoder-decoder architecture to simultaneously infer latent interactions for probing protein allosteric processes as dynamic networks of interacting residues. From the MD trajectories, this model successfully learned the long-range interactions and pathways that can mediate the allosteric communications between distant sites in the Pin1, SOD1, and MEK1 systems. Furthermore, the model can discover allostery-related interactions earlier in the MD simulation trajectories and predict relative free energy changes upon mutations more accurately than other methods. The software is open sources at https://github.com/juexinwang/NRI-MD

G-017: Widespread redundancy in -omics profiles of cancer mutation states
COSI: GenCompBio
  • Jake Crawford, University of Pennsylvania, United States
  • Maria Chikina, University of Pittsburgh, United States
  • Brock Christensen, Geisel School of Medicine at Dartmouth College, United States
  • Casey Greene, University of Colorado School of Medicine, United States


Presentation Overview: Show

Although DNA sequencing identifies cancer mutations, other -omics assays can provide a fuller picture of the cellular dysregulation underlying cancer pathology. However, for a given mutation, it is not always clear which -omics layer will best capture cancer-relevant signal. To evaluate the information content of different -omics types, we use them as input to classifiers trained to distinguish between samples with and without mutations in key cancer genes. Using data from the TCGA Pan-Cancer Atlas, we focus on RNA sequencing, DNA methylation arrays, reverse phase protein arrays, microRNA sequencing, and somatic mutational signatures as readouts of mutational state.

Across a collection of 217 cancer-related genes, RNA sequencing tends to be the most effective predictor of mutational state. Surprisingly, we found that other -omics layers are equally effective predictors for many genes. Mutations in most genes predicted accurately by at least one readout (52/86, or 60.5%) were predicted accurately by two or more independent readouts from the six we considered. We also found that multi-omics models provided little or no predictive improvement over the best single-omics model for six well-studied cancer genes. Our results will inform the future design of studies focused on the functional outcomes of cancer mutations.

G-018: Optimizing Model Selection for Glioblastoma Utilizing Gene Expression
COSI: GenCompBio
  • Avery Williams, University of Alabama at Birmingham, United States
  • Elizabeth Ramsey, University of Alabama at Birmingham, United States
  • Jennifer Fisher, University of Alabama at Birmingham, United States
  • Vishal H. Oza, University of Alabama at Birmingham, United States
  • Brittany Lasseigne, University of Alabama at Birmingham, United States


Presentation Overview: Show

Glioblastoma (GBM) is a debilitating brain cancer that affects around 210,000 people worldwide. Currently, disease diagnosis and monitoring are typically done via tissue biopsy, but this method is invasive, difficult in cases of tumor inaccessibility, and only provides a single snapshot that may not be representative of disease heterogeneity or etiology. Further, there has been difficulty determining viable treatment options for the disease, which has a high relapse and morbidity rate. One possibility for improving patient treatment is through preclinical models like cell lines and patient-derived xenografts (PDXs).In this study, we used public cohort data from The Cancer Genome Atlas (TCGA; GBM patient tissue), the Cancer Cell Line Encyclopedia (CCLE; cell lines), and the Mayo Clinic Brain Tumor Patient-Derived Xenograft National Resource (PDX models) gene expression data to identify global patterns and differences that may suggest the advantages and weaknesses of given preclinical models as avatars for specific patients.

Through hierarchical clustering, ranked correlation, and other analyses we demonstrate strategies for identifying optimal models and their limitations. Our long-term goal is to identify the best model for a patient and to develop computational approaches for assessment and further analyses.

G-019: ATHENA: Analysis of Tumor Heterogeneity from Spatial Omics Measurements
COSI: GenCompBio
  • Adriano Martinelli, ETH Zurich, Switzerland
  • Pushpak Pati, IBM Research Zurich, Switzerland
  • Maria Anna Rapsomaniki, IBM Research Zurich, Switzerland


Presentation Overview: Show

Tumor heterogeneity has emerged as a fundamental property of most human cancers, and its accurate and biologically meaningful quantification has the potential to translate biological complexity into clinically actionable insight. Currently, spatial omics technologies are revolutionizing our understanding of tumor ecosystems, enabling their deep phenotypic profiling at an unprecedented resolution while preserving the tumor topology. Although several spatial omics data analysis tools have started to emerge, adedicated resource that enables tumor heterogeneity quantification is largely missing. We introduce here ATHENA, a computational framework that brings together a large collection of established and novel heterogeneity scores borrowing ideas from spatial statistics, graph theory and information theory, able to capture the heterogeneity of the tumor ecosystem. ATHENA supports any spatial omic dataset, as well as standard tissue imaging data. Using apublicly available imaging mass cytometry dataset, we show how ATHENA can highlight tumor regions of high spatial heterogeneity and quantify spatial properties, cell interaction and immune infiltration patterns present in the tumor ecosystem. ATHENA is implemented in a highly modular, extendable, and scalable fashion, with emphasis in visualization and interoperability with other popular computational frameworks, and it’s available as a Python package under an open-source license here:https://github.com/AI4SCR/ATHENA.

G-020: Binding and sliding of the CRISPR/Cas9-gRNA complex
COSI: GenCompBio
  • Giulia I. Corsi, University of Copenhagen, Denmark
  • Kunli Qu, BGI-Qingdao; University of Copenhagen, China
  • Ferhat Alkan, The Netherlands Cancer Institute, Netherlands
  • Xiaoguang Pan, BGI-Qingdao, China
  • Yonglun Luo, BGI-Qingdao; BGI-Shenzhen; Aarhus University, Denmark
  • Jan Gorodkin, University of Copenhagen, Denmark


Presentation Overview: Show

CRISPR/Cas9 cleavage efficiency largely depends on the properties of the guide RNA (gRNA) employed for target recognition. Using an energy-based model of Cas9-gRNA-target binding we identify a sweet spot range of gRNA-DNA hybridization free energy in which gRNAs are most efficient, whereas more inefficient gRNAs either bind too weak or too strong. The affinity to this sweet-spot range explains, for the first time, why some gRNAs can cleave off-targets more efficiently than their intended on-target. Furthermore, exploring the context of the Cas9 binding site we report that cleavage efficiency can become stronger or weaker depending on the presence of overlapping binding sites, at which Cas9 can “slide” and the gRNA can bind forming bulged matches with the target. We verify the cleavage efficiency behavior related to this sliding activity on both published and in-house generated data. The same observations hold at non-canonical binding sites and for Cas9 variants with increased fidelity or broadened compatibility to binding motifs. The possibility for sliding on adjacent sites is integrated in our previously established energy-based gRNA specificity calculation, allowing us to better isolate highly specific and efficient gRNAs, which are the preferrable choice for practical applications.

G-021: A machine learning approach to detect somatic variants in tumor RNA-Seq.
COSI: GenCompBio
  • Audrey Bollas, Nationwide Children's Hospital, United States
  • Peter White, Nationwide Children's Hospital, United States
  • Elaine Mardis, Nationwide Children's Hospital, United States


Presentation Overview: Show

Reliably identifying genomic variants from next generation sequencing data is a critical step in studying the relationship between genotype and cancer susceptibility and tumorigenesis. Single nucleotide variants (SNVs) are largely obtained through DNA sequencing (DNA-Seq), however recent efforts have been made to use RNA sequencing (RNA-Seq) reads to decipher the impact of variation in the transcriptome. This approach offers the advantage of being less cost intensive than whole genome or whole exome DNA-Seq, while already being generated and applied to numerous analytical pipelines such as gene expression, RNA editing, splicing, and allele specific expression, even without matched DNA. Additionally, there exists the potential to discover new variants in highly expressed genes, those of which may be important oncogenes.

The existing tools for RNA-Seq SNV discovery suffer from high false positive rates and rely on both DNA- and RNA-Seq to differentiate between germline and somatic variants. Due to the nature of rarely acquiring normal tissue from the area of interest for normal transcriptome comparison, tumor-only RNA-Seq is predominately performed. We have developed a bioinformatics approach to identify variants from tumor-only RNA-Seq data, and a machine learning model to accurately classify variants as false positive, germline, or somatic.

G-022: Bayesian network modeling reveals differences in signaling dynamics between self-renewing and proliferating leukemia stem cells
COSI: GenCompBio
  • Daniel Chang, University of Minnesota - Twin Cities, United States
  • Zohar Sachs, University of Minnesota - Twin Cities, United States
  • Chad Myers, University of Minnesota - Twin Cities, United States
  • Karen Sachs, External Consultant, United States
  • Marie Lue Antony, University of Minnesota - Twin Cities, United States
  • Klara Noble, University of Minnesota - Twin Cities, United States


Presentation Overview: Show

Acute myeloid leukemia (AML) is a lethal malignancy. Most patients can achieve complete remission with chemotherapy; however, many patients relapse with refractory disease. Relapse is caused by leukemia stem cells (LSCs), which are endowed with self-renewal capacity. Our goal is to understand the self-renewal mechanisms of LSCs. Our lab showed in a murine model of AML that oncogenic NRASG12V drives either self-renewal or proliferation in LSCs. We identified CD36 and CD69 as cell surface markers that delineate self-renewing and proliferating LSCs. CD36High LSCs were highly proliferative, and CD69High LSCs have higher self-renewal potential.
In this project, we sought to define the signaling states associated with CD69 expression versus those associated with CD36 expression in primary human AML cells. We used mass cytometry (CyTOF) to compare the levels of signaling molecules between CD36High and CD69High LSCs in six primary human samples that harbor either NRAS mutations or MLL rearrangements. Our CyTOF analyses revealed that signaling intermediates are differentially expressed between these groups.
We performed Bayesian networks modeling on our CyTOF data to define the global signaling network of these subpopulations. Our analysis suggests that NRASG12V drives self-renewal in CD69High LSCs through altered signaling dynamics to increase levels of NF-kappaB and p4EBP1.

G-023: Evidence for genetic interactions as shared genetic risk factors for Parkinson’s and Alzheimer’s disease
COSI: GenCompBio
  • Wen Wang, University of Minnesota, United States
  • Chad Myers, University of Minnesota, United States


Presentation Overview: Show

Despite the fact that Parkinson's disease (PD) and Alzheimer's disease (ALZ) share some similarity in their symptoms and risk factors, no clear genetic link has been found between them through traditional genome-wide association studies(GWAS). Given the complexity of both diseases, combinations of genetic variants such as genetic interactions may help to explain their underlying genetic bases. We previously developed a pathway-based approach called BridGE for detecting genetic interactions from GWAS. In this study, we applied BridGE to six PD and four ALZ cohorts and identified 67 and 45 replicable pathway-level interactions in PD and ALZ, respectively. We found 9 pathways with enriched genetic interactions in both PD and ALZ. The most striking pathway discovered is ADORA2B-mediated anti-inflammatory cytokine signaling (FDR<0.1 in both PD and ALZ), with the driver genes contributing to this interaction including ADCYAP1, GLP2R, GNAS, and PRKACG. We also demonstrate that a genetic-interaction-based polygenic risk score derived from this pathway can differentiate case and control groups in both PD and ALZ with a PRAUC of 0.81 and 0.82, respectively. Our study suggests evidence for common genetic interactions underlying disease risk for both PD and ALZ, and we report candidate pathways, genes, and specific variants worthy of further study.

G-024: ARROW - Allele-specific Recombined sgRNA design for Reduced Off-target With computational profiling
COSI: GenCompBio
  • Dongwon Choo, Pusan National University, South Korea
  • Seunghun Kang, HanYang University, South Korea
  • Woochang Hwang, HanYang University, South Korea
  • Junho Hur, HanYang University, South Korea
  • Giltae Song, Pusan National University, South Korea


Presentation Overview: Show

Allele-specific genome editing by CRISPR/Cas9 system is crucial for realizing the precise therapeutic treatment for inherited dominant diseases. However, the sequence tolerance of Cas9/gRNA complexes makes it indistinguishable between mutated alleles and normal alleles that differ only a single nucleotide. Here, we established a strategy to specifically edit only mutated alleles by reducing sequence tolerance for WT alleles through introducing intentional mispairing to gRNA, which can increase the specificity. We developed a website tool that recommends gRNAs with intentional mispairing. To select a mass of mispaired gRNAs, we organize the mismatch sites of gRNA on a genome based on the number of mismatches (mismatch state) and it serves as a basis for comparing gRNAs. The Manhattan distance is used to measure the similarity between gRNAs and HOMS, Hypothetically Optimal Mismatch State (HOMS) for mismatch state of input gRNA, then the mispaired gRNAs with high similarity were recommended. ARROW generates 1,770 mispaired gRNAs (1 or 2 mispaired gRNAs) for input gRNAs and offers mispaired gRNAs having the most similar mismatch state to HOMS. The ARROW proposed mispaired gRNAs with fewer mismatch sites than input gRNA, so they can be expected to have more precise targeting than input gRNA.

G-025: A general framework for the combined morphometric, transcriptomic, and physiological analysis of cells using metric geometry
COSI: GenCompBio
  • Kiya Govek, University of Pennsylvania, United States
  • Jake Crawford, University of Pennsylvania, United States
  • Artur Saturnino, University of Pennsylvania, United States
  • Kristi Zoga, University of Pennsylvania, United States
  • Michael Hart, University of Pennsylvania, United States
  • Pablo Camara, University of Pennsylvania, United States


Presentation Overview: Show

Cell morphology is an essential phenotype in the characterization of cells due to its relation to cell function and its involvement in cellular processes like differentiation and migration. Image-based single-cell transcriptomics can be used to study the molecular mechanisms underlying morphological processes. However, few current methods for cell morphometry can establish associations between morphological and molecular data, and most are limited to predefined shape characteristics or assumptions which might not apply to all cells in a tissue. CAJAL uses the Gromov-Wasserstein distance to build cell morphology summary spaces that describe arbitrary deformations in cell shape and establish associations with single-cell omics and physiological data. We show that CAJAL outperforms current methods at identifying morphological differences between transcriptomically-defined types of cortical neurons profiled with Patch-seq. CAJAL also enables the integration of different imaging technologies and modalities to better utilize the diverse knowledge of available datasets. Using CAJAL, we leveraged published Patch-seq data to refine cell type annotations in a cubic millimeter of the mouse visual cortex profiled with 2-photon and electron microscopy by the MICrONS consortium. We expect CAJAL will be a powerful tool in the study of cell morphology and related processes across diverse tissues and imaging modalities.

G-026: Integrate single cell multi-omics data to robustly identify predictive epi-genes for Staph infection
COSI: GenCompBio
  • Yuan Wang, Princeton University, United States
  • Wan Sze Cheng, Icahn School of Medicine at Mount Sinai, United States
  • Frederique Ruf-Zamojski, Icahn School of Medicine at Mount Sinai, United States
  • Antonio Cappuccio, Icahn School of Medicine at Mount Sinai, United States
  • Venugopalan Nair, Icahn School of Medicine at Mount Sinai, United States
  • Chris Woods, Duke University, United States
  • Elena Zaslavsky, Icahn School of Medicine at Mount Sinai, United States
  • Vance Fowler, Duke University, United States
  • Olga Troyanskaya, Princeton University; Simons Foundation, United States
  • Stuart Sealfon, Icahn School of Medicine at Mount Sinai, United States
  • Xi Chen, Simons Foundation, United States
  • Yuan Wang, Princeton University, United States
  • Xi Chen, Simons Foundation, United States
  • Wan Sze Cheng, Icahn School of Medicine at Mount Sinai, United States
  • Frederique Ruf-Zamojski, Icahn School of Medicine at Mount Sinai, United States
  • Antonio Cappuccio, Icahn School of Medicine at Mount Sinai, United States
  • Venugopalan Nair, Icahn School of Medicine at Mount Sinai, United States
  • Chris Woods, Duke University, United States
  • Elena Zaslavsky, Icahn School of Medicine at Mount Sinai, United States
  • Vance Fowler, Duke University, United States
  • Stuart Sealfon, Icahn School of Medicine at Mount Sinai, United States
  • Olga Troyanskaya, Princeton University; Simons Foundation, United States


Presentation Overview: Show

Extracting robust biological signals from single cell RNA-seq and ATAC-seq datasets on the same samples would benefit from improved methods for automated integration of data from heterogeneous samples and for leveraging related expression and chromatin changes to improve robust identification of regulatory changes. Towards this end, we describe two complementary methods, PISCES (Practical and Integrative Single-CEll analySis) and MAGICAL (Multiome Accessibility Gene Integration Calling And Looping) that we apply to the clinically relevant problem of rapid characterization of Staphylococcus Aureus infections. PISCES describes a reliable integration framework to remove artifacts from heterogeneous samples under multiple batches and conditions, while preserving inter-cell type differences. We validated PISCES integration by high correspondence of cell type proportions with mass cytometry cell type identification on the same samples. The MAGICAL framework combines the integrated single-cell multi-omics data with additional DNA sequence and 3D chromatin information, and for each cell type, builds high-resolution 3D mappings between regulatory ATAC peaks and epi-genes by modeling the coordinated chromatin activity and gene expression changes associated with infections. We show that MAGICAL-selected epigenes are dramatically more predictive for infections than using differentially expressed genes alone.

G-027: Distinct cell-type associations in Alzheimer’s Disease genetic studies
COSI: GenCompBio
  • Qiliang Lai, Department of Computer Science, Rice University, United States
  • Ruth Dannenfelser, Department of Computer Science, Rice University, United States
  • Jean-Pierre Roussarie, Department of Anatomy & Neurobiology, Boston University School of Medicine, United States
  • Vicky Yao, Department of Computer Science, Rice University, United States


Presentation Overview: Show

Alzheimer's disease (AD), accounting for more than 60% of dementia cases, is one of the major causes of death among the aged. Genome-wide association studies (GWAS) have suggested thousands of loci to be related, but other than microglia (a type of non-neuronal cells that helps maintain immune function in brain), relatively few significant cell-type-specific associations have been identified, which is at odds with the well-characterized selective neuronal vulnerability in AD. Here, we explored cell specificity using both human and mouse scRNA-seq data in several AD GWAS with a generalized linear regression framework. Importantly, the GWAS we explored span a range of pathological or clinical endpoints. We recapitulate the previous discovery that clinical GWAS, comprising thousands of patients with various pathological phenotypes, are primarily associated with microglia and more generally immune-related cell types. Interestingly, for GWAS based on pathological endpoints, we observe statistical associations with neurons, and furthermore, these neuron types are also associated with early neurodegeneration in AD. Our findings highlight the heterogeneous, complex nature of AD and the broad challenges of modeling cell-type-based etiology with genetic studies.

G-028: B-assembler: a circular bacterial genome assembler
COSI: GenCompBio
  • Fengyuan Huang, University of Alabama at Birmingham, United States
  • Min Gao, University of Alabama at Birmingham, United States
  • Li Xiao, University of Alabama at Birmingham, United States
  • Ethan J Vallely, University of Alabama at Birmingham, United States
  • Kevin Dybvig, University of Alabama at Birmingham, United States
  • Thomas P Atkinson, University of Alabama at Birmingham, United States
  • Ken B Waites, University of Alabama at Birmingham, United States
  • Zechen Chong, University of Alabama at Birmingham, United States


Presentation Overview: Show

Accurate bacteria genome de novo assembly is fundamental to understand the evolution and pathogenesis of bacteria. The advent and popularity of Third-Generation Sequencing (TGS) enables assembly of bacteria genomes. However, most current TGS assemblers were specifically designed for human or other species with linear genomes. Besides, the repetitive DNA fragments in many bacterial genomes plus the high error rate of long sequencing data make it still very challenging to accurately assemble their genomes even with small genome sizes. Therefore, it is urgent to develop an optimized method to address these issues. We developed B-assembler, which takes advantage of the structural resolving power of long reads and the accuracy of short reads if applicable. It first selects and corrects the ultra-long reads to get an initial contig. Then, it collects the reads overlapping with the ends of the initial contig. This two-round assembling procedure along with optimized error correction enables a high-confidence and circularized genome assembly. Benchmarked on both synthetic and real sequencing data of several species of bacterium, the results show that both long-read-only and hybrid-read modes can accurately assemble circular bacterial genomes free of structural errors and have fewer small errors compared to other assemblers.

G-029: High resolution chromatin loop mapping from sparse Hi-C data based on deep learning
COSI: GenCompBio
  • Shanshan Zhang, Case Western Reserve University, United States
  • Dylan Plummer, Case Western Reserve University, United States
  • Fulai Jin, Case Western Reserve University, United States
  • Jing Li, Case Western Reserve University, United States


Presentation Overview: Show

Mapping chromatin loops from noisy Hi-C heatmaps remains a major challenge. Here we present DeepLoop, which performs rigorous bias-correction followed by deep-learning-based signal-enhancement for robust chromatin interaction mapping from low-depth Hi-C data. DeepLoop enables loop-resolution single-cell Hi-C analysis. It also achieves a cross-platform convergence between different Hi-C protocols and micro-C. DeepLoop allowed us to map the genetic and epigenetic determinants of allele-specific (AS) chromatin interactions in human genome. We nominate new loci with AS-interactions governed by imprinting or allelic DNA methylation. We also discovered that in the inactivated X chromosome (Xi), local loops at the DXZ4 “megadomain” boundary escape X-inactivation, but the FIRRE “superloop” locus does not escape. Importantly, DeepLoop can pinpoint heterozygous SNPs and large structure variants (SVs) that cause allelic chromatin loops, many of which rewire enhancers with transcription consequences. Taken together, DeepLoop expands the use of Hi-C to provide loop-resolution insights into the genetics of 3D genome.

G-030: Supercomputing-aided analysis of docking property between SARS-CoV-2 and host variations of hACE2 from a large cohort
COSI: GenCompBio
  • Hyojung Paik, KISTI, South Korea
  • Jimin Kim, KISTI, South Korea
  • Sangjae Seo, KISTI, South Korea


Presentation Overview: Show

The recent novel coronavirus disease (COVID-19) outbreak, caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is threatening global health. However, understanding of the interaction of SARS-CoV-2 to human cells including physical docking property by host's genetic diversity is still lacking. Here, based on germline variations in the UK Biobank covering 502,543 of individuals, we unraveled the molecular interactions between human angiotensin converting enzyme 2 (hACE2), which is the representative receptor for SARS-CoV-2 entry, and COVID-19 infection. We identified six of nonsense and missense variants of hACE2 from 2585 subjects in UK Biobank covering over 500000 individuals. Using our molecular dynamics simulations, three variants in hACE2 out of 2585 individuals we selected showed higher binding free energy for docking in the range of 1.44 - 3.69 kcal/mol. Although there are diverse contributors to the infections of SARS-CoV-2 including mobility of individuals, we analyzed diagnoses records of individuals with those three variants of hACE2. Our molecular dynamics simulations with convergence of population-based genomic data provided us atomistic understanding of interaction between hACE2 and spike protein of SARS-CoV-2.

G-031: Single cell Correlation Analysis: Identifying a self-renewing subpopulation of human Leukemia stem cells using single-cell RNA-seq analysis
COSI: GenCompBio
  • Yoonkyu Lee, University of Minnesota Twin Cities, United States
  • Wen Wang, University of Minnesota Twin Cities, United States
  • Timothy K Starr, University of Minnesota Twin Cities, United States
  • Klara E Noble-Orcutt, University of Minnesota Twin Cities, United States
  • Chad L Myers, University of Minnesota Twin Cities, United States
  • Zohar Sachs, University of Minnesota Twin Cities, United States


Presentation Overview: Show

Acute myeloid leukemia (AML) is a rapidly fatal blood cancer. Leukemia stem cells (LSCs), the subpopulation of leukemia cells with self-renewal capacity, cause relapse and death in AML. Our goal is to identify the molecular mechanisms of self-renewal in LSCs in order to define therapeutic targets that prevent AML relapse. Previously, we performed single-cell RNA sequencing (scRNAseq) of murine LSCs, coupled with in vivo leukemia reconstitution assays to define and experimentally validate the single cell gene expression profile (scGEP) of self-renewal.
Here, we develop a novel method to investigate transcriptional mechanisms of self-renewal in primary human AML samples. We designed Single cell Correlation Analysis (SCA) to determine whether human LSCs express the self-renewal scGEP that we defined in murine AML. SCA employs a rigorous statistical framework to define similarity to a reference profile in a query scGE dataset. We used SCA to analyze scRNAseq data from primary human AML LSC samples and we discovered that human AML samples harbor cells that express the murine self-renewal scGEP. SCA identified putative human self-renewing LSCs at the single-cell level. SCA is a novel algorithm that identifies cells of interest within a single-cell transcriptional data set with statistical confidence.

G-032: Expression-Driven Dependency Analyses Identify New Cancer Vulnerabilities
COSI: GenCompBio
  • Abdulkadir Elmas, Icahn School of Medicine at Mount Sinai, United States
  • Kuan-Lin Huang, Icahn School of Medicine at Mount Sinai, United States


Presentation Overview: Show

Cancer cells harboring different molecular aberrations respond differently to genetic knockdown/knockouts and show different genetic dependencies. Precision therapeutic targets may be identified through expression-driven dependency, whereby cancer cells with high expression of the targeted genes showed lower viability and were more vulnerable to the genetic knockdown/knockout. Here, we developed a Bayesian approach to jointly analyze global proteomic and transcriptomic profiles and genetic dependencies of 375 cancer cell lines across various tissue types from the Cancer Dependency Map (DepMap). We identified actionable targets validating known drug-gene relationships in drug-gene databases, e.g., SOX10 and ESR1 in skin and breast cancers, respectively. Meanwhile, we also revealed new treatment candidates for each cancer type, including IRF4 and PAX proteins in skin cancer and TP63 protein in lung cancer. Within each cell lineage, mRNA/protein-based expression-driven dependencies showed high concordance, and we further demonstrated that the identified proteins showed 3~5 fold enrichment for known drug targets. In conclusion, by applying a Bayesian approach to determine expression-driven dependency, our analyses effectively highlighted precision treatment targets and drugs for cancer cells across diverse lineages.

G-033: The role of cell-cell communication in Alzheimer’s disease
COSI: GenCompBio
  • Tabea M. Soelter, The University of Alabama at Birmingham, United States
  • Vishal H. Oza, The University of Alabama at Birmingham, United States
  • T.C. Howton, The University of Alabama at Birmingham, United States
  • Brittany N. Lasseigne, The University of Alabama at Birmingham, United States


Presentation Overview: Show

Although neuronal loss is a primary hallmark of Alzheimer’s disease (AD), it is known that non-neuronal cell populations maintain brain homeostasis and neuronal health. Non-neuronal cell populations are comprised of microglia (e.g. macrophages) and macroglia (e.g. astrocytes and oligodendrocytes). Astrocytes provide metabolic and nutritional support to neurons, while oligodendrocytes are responsible for the myelination of the axons of neurons. Neuron-glia and glial cell crosstalk via chemical messengers enables normal cognitive function, as neuronal health and functionality is maintained. Reactive microglia have been implicated in activation of reactive astrocytes, which are known to be neurotoxic and lead to neuronal degeneration. Yet the causal mechanisms of altered neuron-glia interactions underlying neurodegenerative diseases like AD are not fully understood. snRNA-seq can address cell-type-specific gene expression changes and allows the application of in silico methodologies that predict ligand-receptor interactions and infer cell-cell communication (e.g., NicheNet). Using publicly available post-mortem human AD and control brain snRNA-seq data, we identify ligand-receptor pairs in AD. We find cell-type-specific expression of ligands across specified sender cell populations (astrocytes, microglia, oligodendrocytes, and OPCs) and receiver cells (neurons) as well as pathways with differential expression between AD and control brain samples.

G-034: Saltwater cell factories enabled by synthetic biology in the extremophile oleaginous yeasts Debaryomyces hansenii
COSI: GenCompBio
  • Sarah Weintraub, Worcester Polytechnic Institute, United States
  • Zekun Li, Worcester Polytechnic Institute, United States
  • Eric Young, Worcester Polytechnic Institute, United States


Presentation Overview: Show

Cell factories for biofuels could be a renewable and environmentally friendly alternative to fossil fuels. However, the scale needed for production of fuels is impossible with current cell factories based on model organisms that do not consume inexpensive biomass, are not tolerant to harsh fermentation conditions, and use a lot of freshwater. We have identified Debaryomyces yeasts as unique candidates for fuel production because they efficiently consume depolymerized lignocellosic biomass, produce fatty acids, and are extremophiles – tolerant to both production conditions and saltwater. Here, we present genomic and transcriptomic analysis of relevant Debaryomyces phenotypes under nitrogen starvation, iron deprivation, and salt stress. We compare the transcriptomic response to two current yeast cell factories – Saccharomyces cerevisiae and Yarrowia lipolytica. We find that unlike Yarrowia, Debaryomyces does not increase its fatty acid production under nitrogen starvation. We also find that Debaryomyces yeasts rewire over a third of the transcriptome in response to salt stress. This includes fatty acid production, membrane integrity, and membrane transport processes that indicate a broad pleiotropic response to ionic stress. In summary, this work sheds light on the genetic basis of extremophile yeast phenotypes and provides targets for future metabolic engineering efforts to develop cell factories.

G-035: Master Regulators of Protein Abundance across 6 Cancer Types
COSI: GenCompBio
  • Zishan Wang, Icahn school of medicine at mount sinai, United States
  • Abdulkadir Elmas, Icahn school of medicine at mount sinai, United States
  • Kuan-Lin Huang, Icahn school of medicine at mount sinai, United States


Presentation Overview: Show

Translation regulation is a critical step for the transmission of genetic information from mRNA to functional proteins. However, the mechanism underlying translation regulation are still not systematically explored in cancer. Here, we inferred thousands of translation regulators (TRs) via integration of genome-wide paired mRNA and protein expression profiles across 6 cancer cohorts. The TRs were significantly enriched at known translation-related factors, such as RNA binding proteins (RBPs), nuclear pore complexes etc, and played roles in the process of spliceosome and protein export. Our systematical prediction of TR provided a valueable resources for explanation of translation regulation in cancer.

G-036: Gradients of Gene Expression in the White/Gray Matter Interface of the Brain Cortex
COSI: GenCompBio
  • Oscar Ospina, Moffitt Cancer Center, United States
  • Inna Smalley, Moffitt Cancer Center, United States
  • Brooke Fridley, Moffitt Cancer Center, United States


Presentation Overview: Show

Most spatial transcriptomic (ST) studies do not fully leverage the spatial information. Often, researchers aim to discern patterns of gene expression at the interface of two tissue domains (e.g., cortex/white matter), achieved through clustering and differential gene expression analysis. To leverage the spatial information in the detection of interface gene expression patterns, we devised an approach to identify gene expression correlations with spatial distance. The method calculates distances from each spot to a tissue feature, then correlates distances to the expression of each gene. We tested the utility of our new method in brain cortex profiled using 10X Visium. As expected, genes involved in myelination such as MBP (r=-0.53, p=0.00) and PLP1 (r=-0.40, p=0.00) showed negative correlations with distance from the white matter, indicative of the white/gray matter oligodendrocyte gradient. Accordingly, the GO term associated with biosynthesis of peptides was enriched in this gradient. Other GO terms enriched by genes correlated with distance included regulation of synaptic plasticity and neuron projection development. In summary, we present an approach that will enable researchers to discover functional processes and signaling pathways that are spatially dependent.

G-037: AcrFinder & AcaFinder : genome mining anti-CRISPR operons & anti-CRISPR associated proteins in prokaryotes and their viruses
COSI: GenCompBio
  • Yanbin Yin, University of Nebraska-Lincoln, United States
  • Bowen Yang, University of Nebraska-Lincoln, United States
  • Haidong Yi, University of North Carolina at Chapel Hill, United States
  • Jinfang Zheng, University of Nebraska-Lincoln, United States


Presentation Overview: Show

Anti-CRISPR (Acr) and anti-CRISPR associated (Aca) proteins encoded by (pro)phages/(pro)viruses have a great potential to enable a more controllable genome editing. Here we present AcrFinder and AcaFinder. AcrFinder is a web server (http://bcb.unl.edu/AcrFinder) designed for Acr screening. The tool has the following unique functions: (i) the first online server specifically mining genomes for Acr-Aca operons; (ii) provides a most comprehensive Acr and Aca (Acr-associated regulator) database; (iii) combines homology based, GBA-based, and self-targeting approaches in one software package; and (iv) it provides a user-friendly web interface. AcrFinder had a 100% recall from validation tests. As the first ever Aca protein scanning stand alone tool and web server, AcaFinder has a recall of 92%. Functional features of AcaFinder include: i) identify both potential Acas and their associated Acr-Aca operons based on GBA; ii) identify Aca-like proteins using built-in Aca HMMs; iii) provide potential prophage regions, CRISPR-Cas systems, and STSS within user’s genomic data of input; iv) provide user-friendly web interface that generates graphical representations of identified Aca/Acr-Aca operons with associated CRISPR-Cas, prophage, and STSS information in terms of genomic context. Both AcrFinder and AcaFinder will be a valuable resource to the anti-CRISPR & genome editing research community.

G-038: QuaC Pipeline Enables Consistent and Standardized Quality Control of Genome and Exome Sequencing Data
COSI: GenCompBio
  • Manavalan Gajapathy, University of Alabama at Birmingham, United States
  • Brandon Wilk, University of Alabama at Birmingham, United States
  • Donna Brown, University of Alabama at Birmingham, United States
  • Elizabeth Worthey, University of Alabama at Birmingham, United States


Presentation Overview: Show

Quality control (QC) is an essential component when utilizing Genome and Exome sequencing data to ensure that they are of sufficiently good quality to use with the experiments planned. Performing QC, however, gets complicated by both difficulties in running the tools and interpreting their results in a consistent manner. To address this, we developed a QC pipeline called QuaC using Snakemake and Python. QuaC isolates each job execution into Singularity container environment and includes system testing to assist with reproducibility and portability. At our center, QuaC serves as a companion pipeline to the small variant caller pipeline, which accepts FASTQs as input and produces BAM and small variant VCF as output. QuaC performs three major tasks while integrating and standardizing QC best practices at our center: (1) Executes several QC tools at the project level using BAM and VCF files as input and further utilizes FASTQ based QC results produced by the upstream small variant caller pipeline; (2) Summarizes whether samples have passed the user-configurable QC thresholds using the tool called QuaC-Watch and thereby enabling consistency ; (3) Aggregates QC results, including QuaC-Watch results, into a MultiQC report, both at the sample- and project-level.

G-039: A Comprehensive Compendium of Breast Cancer Gene-Expression Datasets
COSI: GenCompBio
  • Ifeanyichukwu Nwosu, Brigham Young University, United States
  • Stephen Piccolo, Brigham Young University, United States


Presentation Overview: Show

To make biological interpretations from transcriptomic data, a researcher must download the data, perform quality checks, clean, and standardize the data. Then he/she must derive statistical conclusions from the data. Although it is possible to access many available data sets to address such questions, it is difficult for researchers—especially those with limited computational expertise—to perform these processing steps and then be able to make sound interpretations of the data.

We have curated 70 publicly available, breast cancer datasets representing 15,137 breast cancer patients, uniformly processed them and standardized the metadata variables against the National Cancer Institute Thesaurus, a popular standard with unique codes for biomedical terms and ideas. This is useful because it makes it easier to infer mappings when researchers want to combine datasets. These curated datasets have a wide range of metadata variables such as hormone receptor status, race, disease stage, tumor size etc. This curated data will be freely available for other researchers to analyze. We believe that having this resource together in one place will minimize time spent on data manipulation—allowing researchers to focus on answering biomedical questions rather than on developing computational pipelines to process the data, thus potentially accelerating biomedical research.

G-040: Resolving Network Clusters Disparity Based on Dissimilarity Measurements with Non-metric Analysis of Variance
COSI: GenCompBio
  • Alina Malyutina , University of Helsinki , Finland
  • Jing Tang, University of Helsinki , Finland
  • Ali Amiryousefi, University of Helsinki, Finland


Presentation Overview: Show

The Nonmetric Analysis of Variance (nmANOVA) conveys a framework that allows a compatible type of ANOVA for the cases where the proper metric measurements between objects are either lost, unknown or however inaccessible. While classic ANOVA is based on the measurements of the data from a base datum, the nmANOVA is formulated on the dissimilarity outputs (not necessarily metric) defined between all objects. As the main goal of ANOVA in providing a statistical test for assessing the significance of a considered partitioning on the data, the nmANOVA is yielding a paralleled scheme of inference with 1) accommodating the outcomes dissimilarities into within and between groups statistics, 2) assessing their respective divergence with a parametric distribution, and 3) providing a resultant p-value indicative of evidences fore rejecting the null hypothesis.

G-041: dbCAN-profiler: automated carbohydrate-active enzyme annotation using raw sequence reads
COSI: GenCompBio
  • Yanbin Yin, University of nebraska Lincoln, United States
  • Jinfang Zheng, University of nebraska Lincoln, United States
  • Qiwei Ge, University of nebraska Lincoln, United States


Presentation Overview: Show

Since 2012, dbCAN has become the most popular bioinformatics tool for automated Carbohydrate-active enzyme Annotation in microbiome research. Currently. dbCAN only allows assembled genomes as the input for CAZyme annotation. With cheaper DNA sequencing, it is now easy to obtain a massive amount of sequence reads from a large number of metagenomic DNA/RNA samples. There is an urgent need from microbiome researchers to predict CAZymes using raw reads without assembling them into contigs, which is very time confusing and error-prone. This demands an assemble-free method to annotate CAZymes, which can also allow to quantify the CAZyme abundance across multiple samples. we are developing dbCAN-profiler, a software to allow users to input raw reads from multiple microbiome samples. The output will be the occurrence and abundance of CAZymes in different samples. Our preliminary results indicate that (i) the assemble-free method is much faster in runtime, (ii) has better accuracy for CAZyme occurrence prediction when the sequencing coverage is low, but (iii) is generally less accurate in CAZyme abundance prediction than the assembly-based method.